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Preface to the 1996 Edition 


The following are natural goals for an aspiring textbook writer of a book like 
this one: 


1. It should be attractive to first year graduate students from a variety of 
engineering and scientific disciplines. 


2. The text should be self-contained, assuming only a good undergraduate 
background in linear algebra. 


3. The students should learn the mathematical basis of the field, as well as 
how to build or find good numerical software. 


4. Students should acquire practical knowledge for solving real problems 
efficiently. In particular, they should know what the state-of-the-art 
techniques are in each area, or when to look for them and where to find 
them. 


5. It should all fit in one semester, since that is what most students have 
available for this subject. 


The fifth goal is perhaps the hardest to manage. The first edition of these 
notes was 215 pages, which did fit into one ambitious semester. This edition 
has more than doubled in length, which is certainly too much material for 
even a heroic semester. The new material reflects the growth in research in 
the field in the last few years, including my own. It also reflects requests from 
colleagues for sections on certain topics that were treated lightly or not at all 
in the first edition. Notable additions include 


e a class homepage with Matlab source code for examples and homework 
problems in the text, and pointers to other on-line software and text- 
books; 


e more pointers in the text to software for all problems, including a sum- 
mary of available software for solving sparse linear systems using direct 
methods; 


e anew chapter on Krylov subspace methods for eigenvalue problems; 


e asection on domain decomposition, including both overlapping and nonover- 
lapping methods; 


xi 


xii Preface 
e sections on “relative perturbation theory” and corresponding high-accuracy 
algorithms for eigenproblems, like Jacobi and qd; 


e more detailed performance comparisons of competing least squares and 
symmetric eigenvalue algorithms; and 


e new homework problems, including many contributed by Zhaojun Bai. 


A reasonable one-semester curriculum could consist of the following chap- 
ters and sections: 


e all of Chapter 1; 
e Chapter 2, excluding sections 2.2.1, 2.4.3, 2.5, 2.6.3, and 2.6.4; 


e Chapter 3, excluding section 3.5; 


Chapter 4, up to and including section 4.4.5; 


Chapter 5, excluding sections 5.2.1, 5.3.5, 5.4 and 5.5; and 


Chapter 6, excluding sections 6.3.3, 6.5.5, 6.5.6, 6.6.6, 6.7.2, 6.7.3, 6.7.4, 
6.8, 6.9.2, and 6.10. 


Homework problems are marked Easy, Medium or Hard, according to their 
difficulty. Problems involving significant amounts of programming are marked 
“programming” . 

Many people have helped contribute to this text, notably Zhaojun Bai, 
Alan Edelman, Velvel Kahan, Richard Lehoucq, Beresford Parlett, and many 
anonymous referees, all of whom made detailed comments on various parts of 
the text. Table 2.2 is taken from the PhD thesis of my student Xiaoye Li. Alan 
Edelman at MIT and Martin Gutknecht at ETH Zurich provided hospitable 
surroundings at their institutions while this final edition was being prepared. 
Many students at Courant, Berkeley, Kentucky and MIT have listened to and 
helped debug this material over the years, and deserve thanks. Finally, Kathy 
Yelick has contributed scientific comments, latex consulting, and moral support 
over more years than either of us expected this project to take. 


James Demmel 
MIT 
September, 1996 


Introduction 


1.1. Basic Notation 


In this course we will refer frequently to matrices, vectors, and scalars. A 
matrix will be denoted by an upper case letter such as A, and its (i, j)th 
element will be denoted by aij. If the matrix is given by an expression such 
as A+ B, we will write (A + B),;. In detailed algorithmic descriptions we 
will sometimes write A(z, j) or use the Matlab [182] notation A(z: j,k : 1) to 
denote the submatrix of A lying in rows 7 through j and columns k through 
l. A lower-case letter like x will denote a vector, and its ith element will 
be written x;. Vectors will almost always be column vectors, which are the 
same as matrices with one column. Lower-case Greek letters (and occasionally 
lower-case letters) will denote scalars. R will denote the set of real numbers; 
IR”, the set of n-dimensional real vectors; and R™*", the set of m-by-n real 
matrices. C, C”, and C™*” denote complex numbers, vectors, and matrices, 
respectively. Occasionally we will use the shorthand A™*"” to indicate that A is 
an m-by-n matrix. AT will denote the transpose of the matrix A: (A?) i; = Ajj. 
For complex matrices we will also use the conjugate transpose A*: (A*)i; = Gji. 
Rz and Sz will denote the real and imaginary parts of the complex number 
z, respectively. If A is m-by-n, then |A| is the m-by-n matrix of absolute 
values of entries of A: (|A|); = |aij|. Inequalities like |A| < |B| are meant 
componentwise: |a;;| < |b;;| for all i and j. We will also use this absolute value 
notation for vectors: (|x|); = |a;|. Ends of proofs will be marked by O, and 
ends of examples by ©. Other notation will be introduced as needed. 


1.2. Standard Problems of Numerical Linear Algebra 
We will consider the following standard problems: 


e Linear systems of equations: Solve Ax = b. Here A is a given n-by-n 
nonsingular real or complex matrix, b is a given column vector with n 
entries, and x is a column vector with n entries that we wish to compute. 
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e Least squares problems: Compute the x that minimizes || Ax — b||2. Here 
A is m-by-n, b is m-by-1, x is n-by-1, and |lyll2 = >= |y;|? is called 
the two-norm of the vector y. If m > n so that we have more equations 
than unknowns, the system is called overdetermined. In this case we 
cannot generally solve Ax = b exactly. If m < n, the system is called 
underdetermined, and we will have infinitely many solutions. 


e Figenvalue problems: Given an n-by-n matrix A, find an n-by-1 nonzero 
vector x and a scalar ÀA so that Ax = Az. 


e Singular value problems: Given an m-by-n matrix A, find an n-by-1 
nonzero vector x and scalar À so that A’ Ax = Ax. We will see that this 
special kind of eigenvalue problem is important enough to merit separate 
consideration and algorithms. 


We choose to emphasize these standard problems because they arise so 
often in engineering and scientific practice. We will illustrate them throughout 
the book with simple examples drawn from engineering, statistics, and other 
fields. There are also many variations of these standard problems that we will 
consider, such as generalized eigenvalue problems Ax = Bz (section 4.5) and 
“rank-deficient” least squares problems min, || Az — b||2, whose solutions are 
nonunique because the columns of A are linearly dependent (section 3.5). 

We will learn the importance of exploiting any special structure our problem 
may have. For example, solving an n-by-n linear system costs 2/3n° floating 
point operations if we use the most general form of Gaussian elimination. If we 
add the information that the system is symmetric and positive definite, we can 
save half the work by using another algorithm called Cholesky. If we further 
know the matrix is banded with semibandwidth yn (i.e., aij = 0 if |i— j| > vn), 
then we can reduce the cost further to O(n?) by using band Cholesky. If we 
say quite explicitly that we are trying to solve Poisson’s equation on a square 
using a 5-point difference approximation, which determines the matrix nearly 
uniquely, then by using the multigrid algorithm we can reduce the cost to O(n), 
which is nearly as fast as possible, in the sense that we use just a constant 
amount of work per solution component (section 6.4). 


1.3. General Techniques 
There are several general concepts and techniques that we will use repeatedly: 
1. matrix factorizations; 


2. perturbation theory and condition numbers; 


3. effects of roundoff error on algorithms, including properties of floating 
point arithmetic; 
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4. analysis of the speed of an algorithm; 


5. engineering numerical software. 


We discuss each of these briefly below. 


1.3.1. Matrix Factorizations 


A factorization of the matrix A is a representation of A as a product of several 
“simpler” matrices, which make the problem at hand easier to solve. We give 
two examples. 


EXAMPLE 1.1. Suppose that we want to solve Ax = b. If A is a lower trian- 
gular matrix, 


a11 Ly by 
a21 Q22 T2 bə 
Ani Gnd ... Ann Le bn 


is easy to solve using forward substitution: 


fori=lton 


xi = (bi — Ji} inte) / ais 
end for 


An analogous idea, back substitution, works if A is upper triangular. To 
use this to solve a general system Ax = b we need the following matrix factor- 
ization, which is just a restatement of Gaussian elimination. 


THEOREM 1.1. If the n-by-n matrix A is nonsingular, there exists a permu- 
tation matrix P (the identity matrix with its rows permuted), a nonsingular 
lower triangular matrix L, and a nonsingular upper triangular matriz U such 
that A= P-L-U. To solve Ax =b, we solve the equivalent system PLUx = b 
as follows: 

LUx=P-'b= PTb (permute entries of b), 

Ux = bP? s) (forward substitution), 

x = U- (Lt Pt Dh) (back substitution). 


We will prove this theorem in section 2.3. © 


EXAMPLE 1.2. The Jordan canonical factorization A = VJV~! exhibits the 
eigenvalues and eigenvectors of A. Here V is a nonsingular matrix, whose 
columns include the eigenvectors, and J is the Jordan canonical form of A, 
a special triangular matrix with the eigenvalues of A on its diagonal. We 
will learn that it is numerically superior to compute the Schur factorization 
A = UTU*, where U is a unitary matrix (i.e., U’s columns are orthonormal), 
and T is upper triangular with A’s eigenvalues on its diagonal. The Schur form 
T can be computed faster and more accurately than the Jordan form J. We 
discuss the Jordan and Schur factorizations in section 4.2. © 
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1.3.2. Perturbation Theory and Condition Numbers 


The answers produced by numerical algorithms are seldom exactly correct. 
There are two sources of error. First, there may be errors in the input data 
to the algorithm, caused by prior calculations or perhaps measurement errors. 
Second, there are errors caused by the algorithm itself, due to approximations 
made within the algorithm. In order to estimate the errors in the computed 
answers from both these sources, we need to understand how much the solution 
of a problem is changed (or perturbed) if the input data is slightly perturbed. 


EXAMPLE 1.3. Let f(x) be a real-valued continuous function of a real variable 
x. We want to compute f(x), but we do not know x exactly. Suppose instead 
that we are given x+ ôx and a bound on ôx. The best that we can do (without 
more information) is to compute f(x + ôx) and to try to bound the absolute 
error |f (x+ ôx) — f(a)|. We may use a simple linear approximation to f to get 
the error bound f(x + 6x) ~ f(a) + 6xf’(x), and so the error is | f(a + da) — 
f(x)| ~ |dax|-|f’(a)|. We call |f’(ax)| the absolute condition number of f at x. 
If |f’(x)| is large enough, then the error may be large even if ôx is small; in 
this case we call f ill-conditioned at x. © 


We say absolute condition number because it provides a bound on the 
absolute error | f(a + ôx) — f(x)| given a bound on the absolute change |éz| in 
the input. We will also often use the following essentially equivalent expression 
to bound the error: 


[f(z + 6x) — Fx). [oa] |F) lel 
|f()| Iz] EEO 


This expression bounds the relative error | f(a + ôx) — f(x)|/|f(x)| as a multi- 
ple of the relative change |6x|/|x| in the input. The multiplier, | f’(a)| - |z|/|f(2)], 
is called the relative condition number, or often just condition number for short. 

The condition number is all that we need to understand how error in the 
input data affects the computed answer: we simply multiply the condition 
number by a bound on the input error to bound the error in the computed 
solution. 

For each problem we consider, we will derive its corresponding condition 
number. 


1.3.3. Effects of Roundoff Error on Algorithms 


To continue our analysis of the error caused by the algorithm itself, we need 
to study the effect of roundoff error in the arithmetic, or simply roundoff for 
short. We will do so by using a property possessed by most good algorithms: 
backward stability. We define it as follows. 


If alg(a) is our algorithm for f(x), including the effects of roundoff, 
we call alg(x) a backward stable algorithm for f(x) if for all x there 


Introduction 5 


is a “small” 6x such that alg(x) = f(x + 6x). ôr is called the 
backward error. Informally, we say that we get the exact answer 
(f(a + 6x)) for a slightly wrong problem (x + ôx). 


This implies that we may bound the error as 


error = |alg(z) — f(x)| = |f (£ + dx) — f(x)| ~ |f (2)| - 8x], 


the product of the absolute condition number |f’(x)| and the magnitude of 
the backward error |dz|. Thus, if alg(-) is backward stable, |dz| is always 
small, so the error will be small unless the absolute condition number is large. 
Thus, backward stability is a desirable property for an algorithm, and most 
of the algorithms that we present will be backward stable. Combined with 
the corresponding condition numbers, we will have error bounds for all our 
computed solutions. 

Proving that an algorithm is backward stable requires knowledge of the 
roundoff error of the basic floating point operations of the machine and how 
these errors propagate through an algorithm. This is discussed in section 1.5. 


1.3.4. Analyzing the Speed of Algorithms 


In choosing an algorithm to solve a problem, one must of course consider 
its speed (which is also called performance) as well as its backward stability. 
There are several ways to estimate speed. Given a particular problem instance, 
a particular implementation of an algorithm, and a particular computer, one 
can of course simply run the algorithm and see how long it takes. This may 
be difficult or time consuming, so we often want simpler estimates. Indeed, we 
typically want to estimate how long a particular algorithm would take before 
implementing it. 

The traditional way to estimate the time an algorithm takes is to count 
the flops, or floating point operations, that it performs. We will do this for 
all the algorithms we present. However, this is often a misleading time es- 
timate on modern computer architectures, because it can take significantly 
more time to move the data inside the computer to the place where it is to 
be multiplied, say, than it does to actually perform the multiplication. This 
is especially true on parallel computers but also is true on conventional ma- 
chines such as workstations and PCs. For example, matrix multiplication on 
the IBM RS6000/590 workstation can be sped up from 65 Mflops (millions of 
floating point operations per second) to 240 Mflops, nearly four times faster, 
by judiciously reordering the operations of the standard algorithm (and using 
the correct compiler optimizations). We discuss this further in section 2.6. 

If an algorithm is iterative, i.e., produces a series of approximations con- 
verging to the answer rather than stopping after a fixed number of steps, then 
we must ask how many steps are needed to decrease the error to a toler- 
able level. To do this, we need to decide if the convergence is linear (i.e., 
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the error decreases by a constant factor 0 < c < 1 at each step so that 
jerror;| < c- |error;—1|) or faster, such as quadratic (|error;| < c- |error;—1|*). If 
two algorithms are both linear, we can ask which has the smaller constant c. 
Iterative linear equation solvers and their convergence analysis are the subject 
of Chapter 6. 


1.3.5. Engineering Numerical Software 


Three main issues in designing or choosing a piece of numerical software are 
ease of use, reliability, and speed. Most of the algorithms covered in this course 
have already been carefully programmed with these three issues in mind. If 
some of this existing software can solve your problem, its ease of use may well 
outweigh any other considerations such as speed. Indeed, if you need only to 
solve your problem once or a few times, it is often easier to use general purpose 
software written by experts than to write your own more specialized program. 

There are three programming paradigms for exploiting other experts’ soft- 
ware. The first paradigm is the traditional software library, consisting of a 
collection of subroutines for solving a fixed set of problems, such as solving 
linear systems, finding eigenvalues, and so on. In particular, we will discuss 
the LAPACK library [10], a state-of-the-art collection of routines available in 
Fortran and C. This library, and many others like it, are freely available in 
the public domain; see NETLIB on the World Wide Web.! LAPACK provides 
reliability and high speed (for example, making careful use of matrix multipli- 
cation, as described above) but requires careful attention to data structures 
and calling sequences on the part of the user. We will provide pointers to such 
software throughout the text. 

The second programming paradigm provides a much easier-to-use environ- 
ment than libraries like LAPACK, but at the cost of some performance. This 
paradigm is provided by the commercial system Matlab [182], among others. 
Matlab provides a simple interactive programming environment where all vari- 
ables represent matrices (scalars are just 1-by-1 matrices), and most linear al- 
gebra operations are available as built-in functions. For example, “C = Ax B” 
stores the product of matrices A and B in C, and “A = inv(B)” stores the 
inverse of matrix B in A. It is easy to quickly prototype algorithms in Matlab 
and to see how they work. But since Matlab makes a number of algorith- 
mic decisions automatically for the user, it may perform more slowly than a 
carefully chosen library routine. 

The third programming paradigm is that of templates, or recipes for as- 
sembling complicated algorithms out of simpler building blocks. Templates are 
useful when there are a large number of ways to construct an algorithm but no 
simple rule for choosing the best construction for a particular input problem; 
therefore, much of the construction must be left to the user. An example of 
this may be found in Templates for the Solution of Linear Systems: Building 


‘Recall that we abbreviate the URL prefix http://www.netlib.org to NETLIB in the text. 
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Blocks for Iterative Methods [24]; a similar set of templates for eigenproblems 
is currently under construction. 


1.4. Example: Polynomial Evaluation 


We illustrate the ideas of perturbation theory, condition numbers, backward 
stability, and roundoff error analysis with the example of polynomial evaluation: 


Horner’s rule for polynomial evaluation is 


P = ad 

for i = d — 1 down to 0 
pP=T*xp+ ai 

end for 


Let us apply this to p(x) = (x — 2)? = x9 — 18x8 + 14427 — 67226 + 2016x° — 
40322 + 5376x? — 46082? + 2304x — 512. In the bottom of Figure 1.1, we see 
that near the zero x = 2 the value of p(x) computed by Horner’s rule is quite 
unpredictable and may justifiably be called “noise.” The top of Figure 1.1 
shows an accurate plot. 

To understand the implications of this figure, let us see what would happen 
if we tried to find a zero of p(x) using a simple zero finder based on Bisection, 
shown below in Algorithm 1.1. 

Bisection starts with an interval [Zlow, £high] in which p(x) changes sign 
(P(£iow)' P(£high) < 0) so that p(x) must have a zero in the interval. Then the 
algorithm computes p(£mia) at the interval midpoint £mid = (Zlow + Zhigh)/2 
and asks whether p(x) changes sign in the bottom half interval [£rlow, Emid] 
or top half interval [£mid, high]. Either way, we find an interval of half the 
original length containing a zero of p(x). We can continue bisecting until the 
interval is as short as desired. 

So the decision between choosing the top half interval or bottom half inter- 
val depends on the sign of p(%mia). Examining the graph of p(x) in the bottom 
half of Figure 1.1, we see that this sign varies rapidly from plus to minus as 
x varies. So changing Xow OF Lpigh Just slightly could completely change the 
sequence of sign decisions and also the final interval. Indeed, depending on the 
initial choices of Zjoy and Xpjgn, the algorithm could converge anywhere inside 
the “noisy region” from 1.95 to 2.05 (see Question 1.21). 

To explain this fully, we return to properties of floating point arithmetic. 


ALGORITHM 1.1. Finding zeros of p(x) using Bisection. 
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1.92 1.94 1.96 1.98 2 2.02 2.04 2.06 2.08 


Fig. 1.1. Plot of y = (x — 2)? = x? — 182° + 1442” — 67226 + 2016x° — 403224 + 
5376a° — 4608x? + 23042 — 512 evaluated at 8000 equispaced points, using y = (x — 2)? 
(top) and using Horner’s rule (bottom). 
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proc bisect (p, Low, high, tol) 
/* find a root of p(x) = 0 in [£iow, Fhigh] 
assuming P(Xiow) ` P(Lnigh) < 0 */ 
/* stop if zero found to within +tol x / 
Plow = P(Zlow) 
Phigh = P(Lhigh) 
while Zhigh — Liow > 2- tol 
XLmid = (Brow + Thigh) /2 
Pmid = Pina) 
if Plow ` Pmia < 0 then /* there is a root in [Zlow, kanal *7 
Thigh = Tmid 
Phigh = Pmid 
else if Pmid ` Phigh < 0 then /* there is a root in [Emid, £hign] */ 
Tlow = Umid 
Plow = Pmid 
else /* &mia ts a root */ 


Tlow = Umid 
Thigh = Lmid 
end if 
end while 
root = Caen + Enigh) | 2 


1.5. Floating Point Arithmetic 


The number —3.1416 may be expressed in scientific notation as follows: 


1 
- 31416 x re 


VE Aes A 


sign fraction base exponent 


Computers use a similar representation called floating point, but gener- 
ally the base is 2 (with exceptions, such as 16 for IBM 370 and 10 for some 
spreadsheets and most calculators). For example, .101012 x 2? = 5.2540. 

A floating point number is called normalized if the leading digit of the 
fraction is nonzero. For example, .101012 x2? is normalized, but .010101, x 2? is 
not. Floating point numbers are usually normalized, which has two advantages: 
each nonzero floating point value has a unique representation as a bit string, 
and in binary the leading 1 in the fraction need not be stored explicitly (because 
it is always 1), leaving one extra bit for a longer, more accurate fraction. 

The most important parameters describing floating point numbers are the 
base; the number of digits (bits) in the fraction, which determines the precision; 
and the number of digits (bits) in the exponent, which determines the expo- 
nent range and thus the largest and smallest representable numbers. Different 
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floating point arithmetics also differ in how they round computed results, what 
they do about numbers that are too near zero (underflow) or too big (over- 
flow), whether +oo is allowed, and whether useful nonnumbers are provided 
(sometimes called NaNs, indefinites, or reserved operands) are provided. We 
discuss each of these below. 

First we consider the precision with which numbers can be represented. 
For example, .31416 x 10! has five decimal digits, so any information less than 
.5 x 1074 may have been lost. This means that if x is a real number whose 
best five-digit approximation is .31416 x 10!, then the relative representation 
error in .31416 x 10! is 


|x — .31416 x 10!| < 5X 1074 


~ 16 x 107+. 
31416 x101 — 31416 x10! 5 


The maximum relative representation error in a normalized number occurs for 
.10000 x 10!, which is the most accurate five-digit approximation of all numbers 
in the interval from .999995 to 1.00005. Its relative error is therefore bounded 
by .5- 1074. More generally, the mazimum relative representation error in a 
floating point arithmetic with p digits and base @ is .5 x 817P. This is also half 
the distance between 1 and the next larger floating point number, 1 + 3!~?. 

Computers have historically used many different choices of base, number 
of digits, and range, but fortunately the IEEE standard for binary arithmetic 
is now most common. It is used on SUN, DEC, HP, and IBM workstations 
and all PCs. IEEE arithmetic includes two kinds of floating point numbers: 
single precision (32 bits long) and double precision (64 bits long). 


fraction 


IEEE single precision 1 8 23 


sign exponent 
binary point 


If s, e, and f < 1 are the 1-bit sign, 8-bit exponent, and 23-bit fraction in 
the IEEE single precision format, respectively, then the number represented is 
(—1)*-2¢-7. (1+ f). The maximum relative representation error is 27% ~ 
6- 1078, and the range of positive normalized numbers is from 27126 (the 
underflow threshold) to 2127 . (2 — 2723) ~ 2128 (the overflow threshold), or 
about 10738 to 1038. The positions of these floating point numbers on the real 
number line are shown in Figure 1.2 (where we use a 3-bit fraction for ease of 
presentation). 


IEEE double precision 1 11 52 


sign exponent fraction 


binary point 
If s, e, and f < 1 are the 1-bit sign, 11-bit exponent, and 52-bit fraction 
in IEEE double precision format, respectively, then the number represented is 
(—1)s - 2° 1023. (1 + f). The maximum relative representation error is 2753 ~ 
10-16, and the exponent range is 2~'°?? (the underflow threshold) to 21°23 . 
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7126 2 21272 (2.273) = 
126 underflow overflow 
-27 threshold threshold 
= % -0 and +0 k s! 
5 128 2 124 2 125 | 2 125 2 124 2128 
l | | l l | 
normalized subnormal normalized 
negative numbers positive 
numbers numbers 


Fig. 1.2. Real number line with floating point numbers indicated by solid tick marks. 
The range shown is correct for IEEE single precision, but a 3-bit fraction is assumed 
for ease of presentation so that there are only 23 — 1 = 7 floating point numbers 
between consecutive powers of 2, not 2? — 1. The distance between consecutive tick 
marks is constant between powers of 2 and doubles/halves across powers of 2 (among 
the normalized floating point numbers). +218 and —2'°8, which are one unit in the 
last place larger in magnitude than the overflow threshold (the largest finite floating 
point number, 21?7.(2—27?3)), are shown as dotted tick marks. The figure is symmetric 
about 0; +0 and —0 are distinct floating point bit strings but compare as numerically 
equal. Division by zero is the only binary operation that gives different results, +00 
and —oo, for different signed zero arguments. 


(2 — 2752) ~ 21924 (the overflow threshold), or about 107308 to 10308, 

When the true value of a computation a © b (where © is one of the four 
binary operations +, —, *, and /) cannot be represented exactly as a floating 
point number, it must be approximated by a nearby floating point number 
before it can be stored in memory or a register. We denote this approximation 
by fl(a@b). The difference (a©b) —fl(a@b) is called the roundoff error. If fl(a@b) 
is a nearest floating point number to a © b, we say that the arithmetic rounds 
correctly (or just rounds). IEEE arithmetic has this attractive property. (IEEE 
arithmetic breaks ties, when a © b is exactly halfway between two adjacent 
floating point numbers, by choosing fl(a © b) to have its least significant bit 
zero; this is called rounding to nearest even.) When rounding correctly, if a©b 
is within the exponent range (otherwise we get overflow or underflow), then 
we can write 


fl(a© b) = (a © b)(14+ ô), (1.1) 


where |ô| is bounded by €, which is called variously machine epsilon, machine 
precision, or macheps. Since we are rounding as accurately as possible, € is 
equal to the maximum relative representation error .5- 817P. IEEE arithmetic 
also guarantees that fi(,/a) = Va(1 + ô), with |6| < e. This is the most 
common model for roundoff error analysis and the one we will use in this 
book. A nearly identical formula applies to complex floating point arithmetic; 
see Question 1.12. However, formula (1.1) does ignore some interesting details. 


IEEE arithmetic also includes subnormal numbers, i.e., unnormalized float- 
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ing point numbers with the minimum possible exponent. These represent tiny 
numbers between zero and the smallest normalized floating point number; see 
Figure 1.2. Their presence means that a difference fl(a — y) can never be zero 
because of underflow, yielding the attractive property that the predicate x = y 
is true if and only if fl(z — y) = 0. To incorporate errors caused by underflow 
into formula (1.1) one would change it to 


fl(a © b) = (a © b)(1 +6) +n, 


where || < £ as before, and |7| is bounded by a tiny number equal to the 
largest error caused by underflow (27150 ~ 10745 in IEEE single precision and 
271075 ~ 107324 in IEEE double precision). 

IEEE arithmetic includes the symbols too and NaN (Not a Number).+co is 
returned when an operation overflows, and behaves according to the following 
arithmetic rules: x/+oo = 0 for any finite floating point number zx, x/0 = +00 
for any nonzero floating point number x, +00 + co = +00, etc. An NaN is 
returned by any operation with no well-defined finite or infinite result, such as 
œ- 00, &, 0, v—1, NaN © 2, etc. 

Whenever an arithmetic operation is invalid and so produces an NaN, or 
overflows or divides by zero to produce +00, or underflows, an exception flag is 
set and can later be tested by the user’s program. These features permit one 
to write both more reliable programs (because the program can detect and 
correct its own exceptions, instead of simply aborting execution) and faster 
programs (by avoiding “paranoid” programming with many tests and branches 
to avoid possible but unlikely exceptions). For examples, see Question 1.19, 
the comments following Lemma 5.3, and [80]. 

The most expensive error known to have been caused by an improperly 
handled floating point exception is the crash of the Ariane 5 rocket of the 
European Space Agency on June 4, 1996. See HOME/ariane5rep.html for 
details. 

Not all machines use IEEE arithmetic or round carefully, although nearly 
all do. The most important modern exceptions are those machines produced 
by Cray Research,? although future generations of Cray machines may use 
IEEE arithmetic. Since the difference between fl(a © b) computed on a Cray 
and fl(a © b) computed on an IEEE machine usually lies in the 14th decimal 
place or beyond, the reader may wonder whether the difference is important. 
Indeed, most algorithms in numerical linear algebra are insensitive to details 
in the way roundoff is handled. But it turns out that some algorithms are 
easier to design, or more reliable, when rounding is done properly. Here are 
two examples. 


?We include machines such as the NEC SX-4, which has a “Cray mode” in which it 
performs arithmetic the same way. We exclude the Cray T3D and T3E, which are par- 
allel computers built from DEC Alpha processors, which use IEEE arithmetic very nearly 
(underflows are flushed to zero for speed’s sake). 

3Cray Research was purchased by Silicon Graphics in 1996. 
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When the Cray C90 subtracts 1 from the next smaller floating point num- 
ber, it gets —2~4", which is twice the correct answer, —2~*°. Getting even 
tiny differences to high relative accuracy is essential for the correctness of the 
divide-and-conquer algorithm for finding eigenvalues and eigenvectors of sym- 
metric matrices, currently the fastest algorithm available for the problem. This 
algorithm requires a rather nonintuitive modification to guarantee correctness 
on Cray machines (see section 5.3.3). 

The Cray may also yield an error when computing arccos(a//x? + y?) 
because excessive roundoff causes the argument of arccos to be larger than 1. 
This cannot happen in IEEE arithmetic (see Question 1.17). 

To accommodate error analysis on a Cray C90 or other Cray machines we 
may instead use the model fl(atb) = a(1+061)+6(1+62), fl(axb) = (axb)(1+43), 
and fl(a/b) = (a/b)(1 + 63), with |6;| < £, where £ is a small multiple of the 
maximum relative representation error. 

Briefly, we can say that correct rounding and other features of IEEE arith- 
metic are designed to preserve as many mathematical relationships used to 
derive formulas as possible. It is easier to design algorithms knowing that 
(barring over /underflow) fl(a—) is computed with a small relative error (oth- 
erwise divide-and-conquer can fail), and that —1 < c = fl(x/y x? +y?) < 1 
(otherwise arccos(c) can fail). There are many other such mathematical rela- 
tionships that one relies on (often unwittingly) to design algorithms. For more 
details about IEEE arithmetic and its relationship to numerical analysis, see 
[157, 156, 80]. 

Given the variability in floating point across machines, how does one write 
portable software that depends on the arithmetic? For example, iterative al- 
gorithms that we will study in later chapters frequently have loops such as 


SS 


repeat 


update e 
until “e is negligible compared to f,” 


where e > 0 is some error measure, and f > 0 is some comparison value (see 
section 4.4.5 for an example). By negligible we mean “is e < c-e- f?,” where 
c > 1 is some modest constant, chosen to trade off accuracy and speed of con- 
vergence. Since this test requires the machine-dependent constant £, this test 
has in the past often been replaced by the apparently machine-independent 
test “is e+ f = f?” The idea here is that adding e to f and rounding will 
yield f again if e < ef or perhaps a little smaller. But this test can fail 
(by requiring e to be much smaller than necessary, or than attainable), de- 
pending on the machine and compiler used (see the next paragraph). So the 
best test indeed uses £ explicitly. It turns out that with sufficient care one 
can compute € in a machine-independent way, and software for this is avail- 
able in the LAPACK subroutines slamch (for single precision) and dlamch 
(for double precision). These routines also compute or estimate the overflow 
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threshold (without overflowing!), the underflow threshold, and other parame- 
ters. Another portable program that uses these explicit machine parameters 
is discussed in Question 1.19. 

Sometimes one needs higher precision than is available from IEEE single 
or double precision. For example, higher precision is of use in algorithms such 
as iterative refinement for improving the accuracy of a computed solution of 
Ax = b (see section 2.5.1). So IEEE defines another, higher precision called 
double extended. For example, all arithmetic operations on an Intel Pentium 
(or its predecessors going back to the Intel 8086/8087) are performed in 80-bit 
double extended registers, providing 64-bit fractions and 15-bit exponents. Un- 
fortunately, not all languages and compilers permit one to declare and compute 
with double-extended precision variables. 

Few machines offer anything beyond double-extended arithmetic in hard- 
ware, but there are several ways in which more accurate arithmetic may be 
simulated in software. Some compilers on DEC Vax and DEC Alpha, SUN 
Sparc, and IBM RS6000 machines permit the user to declare quadruple preci- 
sion (or real*16 or double double precision) variables and to perform computa- 
tions with them. Since this arithmetic is simulated using shorter precision, it 
may run several times slower than double. Cray’s single precision is similar in 
precision to IEEE double, and so Cray double precision is about twice IEEE 
double; it too is simulated in software and runs relatively slowly. There are also 
algorithms and packages available for simulating much higher precision float- 
ing point arithmetic, using either integer arithmetic [20, 21] or the underlying 
floating point (see Question 1.18) [202, 216]. 

Finally, we mention interval arithmetic, a style of computation that au- 
tomatically provides guaranteed error bounds. Each variable in an interval 
computation is represented by a pair of floating point numbers, one a lower 
bound and one an upper bound. Computation proceeds by rounding in such a 
way that lower bounds and upper bounds are propagated in a guaranteed fash- 
ion. For example, to add the intervals a = [a;, au] and b = [b;, bu], one rounds 
a, + bı down to the nearest floating point number, c,, and rounds au + bu 
up to the nearest floating point number, c,. This guarantees that the inter- 
val c = |c], Cy] contains the sum of any pair of variables from a and from b. 
Unfortunately, if one naively takes a program and converts all floating point 
variables and operations to interval variables and operations, it is most likely 
that the intervals computed by the program will quickly grow so wide (such as 
[—oo, +00]) that they provide no useful information at all. (A simple example 
is to repeatedly compute x = x — x when z is an interval; instead of getting 
x = 0, the width x, — xı of x doubles at each subtraction.) It is possible to 
modify old algorithms or design new ones that do provide useful guaranteed 
error bounds [4, 138, 160, 188], but these are often several times as expensive 
as the algorithms discussed in this book. The error bounds that we present 
in this book are not guaranteed in the same mathematical sense that interval 
bounds are, but they are reliable enough in almost all situations. (We discuss 
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this in more detail later.) We will not discuss interval arithmetic further in 
this book. 


1.6. Polynomial Evaluation Revisited 


Let us now apply roundoff model (1.1) to evaluating a polynomial with Horner’s 
rule. We take the original program, 


p= ad 

for i = d—1 down to 0 
p=xX-pr a 

end for 


Then we add subscripts to the intermediate results so that we have a unique 
symbol for each one (po is the final result): 


Pd = Gd 

for i = d — 1 down to 0 
Pi = T : Pi+1 + Qi 

end for 


Then we insert a roundoff term (1 + 6;) at each floating point operation to get 


Pd = Gd 
for i = d— 1 down to 0 

pi = ((x < pig1)(1 + ĝi) + ai) (1 + 6), where |6;|, |8| < e 
end for 


Expanding, we get the following expression for the final computed value of the 


polynomial: 
d—1 i-1 d—1 
po= >, (0+ [0Ha] aia OHEA H S) | aar? 
i=0 j=0 j=0 


This is messy, a typical result when we try to keep track of every rounding error 
in an algorithm. We simplify it using the following upper and lower bounds: 


1 
1 — je 
G+) A+) > (l-e > 1- je. 


=1+je+0(e°, 


+8) (+8) < +e < 


These bounds are correct, provided that je < 1. Typically, we make the 
reasonable assumption that je < 1 (j < 10’ in IEEE single precision) and 
make the approximations 


1— je < (14+ 61)---(14+6;) <1+4+ fe. 
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This lets us write 


d 
Po = Y + 6;)a;x", where |6;| < 2de 
i=0 


d 
= ò Tir’ 
i=0 


So the computed value po of p(x) is the exact value of a slightly different 
polynomial with coefficients @;. This means that evaluating p(x) is “backward 
stable,” and the “backward error” is 2de measured as the maximum relation 
change of any coefficient of p(x). 

Using this backward error bound, we bound the error in the computed 
polynomial: 


d 


d 
we + dj)asx" — Soa! 


i=0 i=0 


d d 
oir < X e2d]a; . z| 
i=0 i=0 
d p 
2de% ai al, 
i=0 


Note that X`; |a;x*| bounds the largest value that we could compute if there 
were no cancellation from adding positive and negative numbers, and the error 
bound is 2de times smaller. This is also the case for computing dot products 
and many other polynomial-like expressions. 

By choosing 46; = € - sign (a;xt), we see that the error bound is attainable to 
within the modest factor 2d. This means that we may use 


d i 
d i 
| izo aial 
as the relative condition number for polynomial evaluation. 


We can easily compute this error bound, at the cost of doubling the number 
of operations: 


|[po — p(z)| = 


IA 


p = aq, bp = |aq| 
for i = d — 1 down to 0 


p=T:p+ai 
bp = |x| - bp + |ai| 
end for 


error bound = bp = 2d - £ - bp 


so the true value of the polynomial is in the interval [p — bp, p + bp], and the 
number of guaranteed correct decimal digits is — log4o(| |), These bounds are 
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plotted in the top of Figure 1.3 for the polynomial discussed earlier, (a — 2)%. 
(The reader may wonder whether roundoff errors could make this computed 
error bound inaccurate. This turns out not to be a problem and is left to the 
reader as an exercise.) 

The graph of — logy |Z | in the bottom of Figure 1.3, a lower bound on 
the number of correct decimal digits, indicates that we expect difficulty com- 
puting p(x) to high relative accuracy when p(x) is near 0. What is special 
about p(x) = 0? An arbitrarily small error € in computing p(x) = 0 causes 


an infinite relative error = =. In other words, our relative error bound 


E 
, HRN 
2de YL, |aix*|/| Tey aiz*| is infinite. 


DEFINITION 1.1. A problem whose condition number is infinite is called ill- 
posed. Otherwise it is called well-posed.* 


There is a simple geometric interpretation of the condition number: it tells 
us how far p(x) is from a polynomial which is ill-posed. 


DEFINITION 1.2. Let p(z) = Ea aiz? and q(z) = Da biz’. Define the rel- 
ative distance d(p,q) from p to q as the smallest value satisfying |a; — b;| < 
d(p,q)-|a;| for0 < i < d. (If all a; = 0, then we can more simply write 


Qi 


d(p, q) = maxo<i<a|%—*|.) 


Note that if a; = 0, then b; must also be zero for d(p,q) to be finite. 


THEOREM 1.2. Suppose that p(z) = D aiz? is not identically zero. 


: | er ayn" | 
min{d(p,q) such that q(x) = 0} = = 


d Ti 
Jio laiz’ | 


In other words, the distance from p to the nearest polynomial q whose condition 
number at x is infinite is the reciprocal of the condition number of p(x). 


Proof. Write q(z) = >> bizt = D3 (1 + €;)a;z* so that d(p,q) = max; |e;|. Then 

. . d . d A 
q(x) = 0 implies |p(x)| = |q(x) — p(@)| = |X i-o imix] < J'i- lesaax"| < 
max; |€;| JX; |a;v"|, which in turn implies d(p, q) = max |e;| > |p(x)|/ X; laiz]. 
To see that there is a q this close to p, choose 


—p(x) 


-D laix] 


“This definition is slightly nonstandard, because ill-posed problems include those whose 
solutions are continuous as long as they are nondifferentiable. Examples include multiple 
roots of polynomials and multiple eigenvalues of matrices (section 4.3). Another way to 
describe an ill-posed problem is one in which the number of correct digits in the solution is 
not always within a constant of the number of digits used in the arithmetic in the solution. 
For example, multiple roots of polynomials tend to lose half or more of the precision of the 
arithmetic. 


Ei . sign(a;x’). 
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Fig. 1.3. Plot of error bounds on the value of y = (x — 2)? evaluated using Horner’s 
rule. 
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This simple reciprocal relationship between condition number and distance 
to the nearest ill-posed problem is very common in numerical analysis, and we 
shall encounter it again later. 

At the beginning of the introduction we said that we would use canonical 
forms of matrices to help solve linear algebra problems. For example, knowing 
the exact Jordan canonical form makes computing exact eigenvalues trivial. 
There is an analogous canonical form for polynomials, which makes accurate 
polynomial evaluation easy: p(x) = aq T4 (a — ri). In other words, we rep- 
resent the polynomial by its leading coefficient ag and its roots r1,... Tn. To 
evaluate p(x) we use the obvious algorithm 


P = ad 

for i = 1 to d 
p=p: (x-ri) 

end for 


It is easy to show the computed p = p(x) - (1 + ô), where |8| < 2de; i.e., we 
always get p(x) with high relative accuracy. But we need the roots of the 
polynomial to do this! 


1.7. Vector and Matrix Norms 


Norms are used to measure errors in matrix computations, so we need to 
understand how to compute and manipulate them. 
Missing proofs are left as problems at the end of the chapter. 


DEFINITION 1.3. Let B be a real (complex) linear space R” (or C”). It is 
normed if there is a function ||- || : B > R, which we call a norm, satisfying 
all of the following : 


1) ||z|| > 0, and ||ax|| = 0 if and only if x = 0 (positive definiteness), 
2) |lax|| = la| - ||a|| for any real (or complex) scalar a (homogene- 
ity), 

3) lz + yl] < |la|| + |ly|| (the triangle inequality). 


EXAMPLE 1.4. The most common norms are |||, = (So, |vi{?)!/” for 1 < p < 
oo, which we call p-norms, as well as ||2||,, = max;|x;|, which we call the 
oo-norm or infinity-norm. Also, if ||x|| is any norm and C is any nonsingular 
matrix, then ||Cz|| is also a norm. © 


We see that there are many norms that we could use to measure errors; it 
is important to choose an appropriate one. For example, let zı = [1,2,3]7 in 
meters and x2 = [1.01, 2.01, 2.99]? in meters. Then x2 is a good approximation 
to x, because the relative error as ~ .0033, and x3 = [10, 2.01, 2.99]” is 


a bad approximation because legle = 3. But suppose the first component 
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is measured in kilometers instead of meters. Then in this norm “1 and #3 look 
close: 


001 01 ley — a9] 
#=| 2 |, 4 =| 2.01 |, and “1 2" w 0033. 
3 2.99 #1 Ihoe 
To compare ĉı and #3, we should use 
1000 
lêlls = 1 ĉ 
1 
[0.0] 


to make the units the same or so that equally important errors make the norm 
equally large. 

Now we define inner products, which are a generalization of the standard 
dot product X; Ziyi, and arise frequently in linear algebra. 


DEFINITION 1.4. Let B be a real (complex) linear space. (-,-) : B x B > R(C) 
is an inner product if all of the following apply : 


) (x,y) = (y, £} (or (y,2)), 

) z) = (x,y) + (T, 2), 

3) (ax, y) = a(x, y) for any real (or complex) scalar a, 
) tanya) > 0; and (2,2) =0 if and only if z =Q. 


EXAMPLE 1.5. Over R, (£, y) = y’x = J`; tiyi, and over C, (z,y) = y*z = 
>>; zð; are inner products. (Recall that y* = y7 is the conjugate transpose of 


y.) © 
DEFINITION 1.5. x and y are orthogonal if (x, y} = 0. 


The most important property of an inner product is that it satisfies the 
Cauchy—Schwartz inequality. This can be used in turn to show that ,/(x, 2) is 
a norm, one that we will frequently use. 


LEMMA 1.1. Cauchy—Schwartz inequality. |(x,y)| < y (x, x) - (y, y}. 
LEMMA 1.2. 4/(x,x) is a norm. 


There is a one-to-one correspondence between inner-products and symmet- 
ric (Hermitian) positive definite matrices, as defined below. These matrices 
arise frequently in applications. 


DEFINITION 1.6. A real symmetric (complex Hermitian) matrix A is positive 
definite if a7 Ar > 0 (2*Ax > 0) for all x = 0. We abbreviate symmetric 
positive definite to s.p.d., and Hermitian positive to h.p.d.. 
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LEMMA 1.3. Let B = R” (or C”) and (.,-) be an inner product. Then there 
is an n-by-n s.p.d. (h.p.d.) matrix A such that (x,y) = yT Ax (y* Ax). Con- 
versely, if A is s.p.d (h.p.d.), then yT Ax (y*Ax) is an inner product. 


The following two lemmas are useful in converting error bounds in terms 
of one norm to error bounds in terms of another. 


LEMMA 1.4. Let ||- ||a and ||- ||g be two norms on R” (or C”). There are 
constants c1,c2 > 0 such that, for all x, c1||r\la < |lzllg < callalla. We also 
say that norms ||- ||a and ||- ||g are equivalent with respect to constants cı and 
C2. 
LEMMA 1.5. 

læll2 lælla vnliæll2, 


< < 
lelo < leælļ2 < Vallzlloo, 
lelo < leli < nlællo. 


In addition to vector norms, we will also need matrix norms to measure 
errors in matrices. 


DEFINITION 1.7. ||- || is a matrix norm on m-by-n matrices if it is a vector 
norm on m-n dimensional space: 


1) || A|| > 0 and ||A|| = 0 if and only if A = 0, 
2) |laAl] = Jal - |All, 
3) ||[A+ Bll < |All + [|B]. 


EXAMPLE 1.6. maxi; |a;j| is called the max norm, and (X |a;;|?)'/? = ||Allr 
is called the Frobenius norm. © 


The following definition is useful for bounding the norm of a product of 
matrices, something we often need to do when deriving error bounds. 


DEFINITION 1.8. Let ||- ||mxn be a matrix norm on m-by-n matrices, || + ||nxp 
be a matriz norm on n-by-p matrices, and ||- ||mxp be a matriz norm on m- 
by-p matrices. These norms are called mutually consistent if ||A-Bllmxp < 
|Allmxn + ||Bllnxp, where A is m-by-n and B is n-by-p. 


DEFINITION 1.9. Let A be m-by-n, ||- 
be a vector norm on R”. Then 


m be a vector norm on R™, and ||- 


À 


|A 


mì = m 
x 


is called an operator norm or induced norm or subordinate matrix norm. 
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The next lemma provides a large source of matrix norms, ones that we will 
use for bounding errors. 


LEMMA 1.6. An operator norm is a matrix norm. 


Orthogonal and unitary matrices, defined next, are essential ingredients of 
nearly all our algorithms for least squares problems and eigenvalue problems. 


DEFINITION 1.10. A real square matrix Q is orthogonal if Q7! = QT. A 
complex square matriz is unitary if Q~' = Q*. 


All rows (or columns) of orthogonal (or unitary) matrices have unit 2-norms 
and are orthogonal to one another, since QQ? = QTQ = I (QQ* = Q*Q=1). 

The next lemma summarizes the essential properties of the norms and 
matrices we have introduced so far. We will use these properties later in the 
book. 


LEMMA 1.7. 1. |[Aa|| < ||Al] - ||2|| for a vector norm and its corresponding 
operator norm, or the vector two-norm and matrix Frobenius norm. 


2. || ABI] < ||A|| - || Bl] for any operator norm or for the Frobenius norm. 
In other words, any operator norm (or the Frobenius norm) is mutually 
consistent with itself. 


3. The max norm and Frobenius norm are not operator norms. 


4. ||QAZ|| = ||Al| if Q and Z are orthogonal or unitary for the Frobenius 
norm and for the operator norm induced by ||- ||2. This is really just the 
Pythagorean theorem. 


A , 

5. |All = max,- 122l = max; X lai] = maximum absolute row sum. 
lzel PT 
A 4 

6. ||Al], = max,—o Lad = ||AT ||. = max; X; |a;;| = maximum absolute 


column sum. 


7. ||Allo = max,—p Lee = 4/XAmax(A*A), where Amax denotes the largest 
eigenvalue. 


8. || Allo = || AT lo. 


9. || Allo = max, |A;(A)| if A is normal, i.e., AA* = A*A. 
10. If A is n-by-n, then n—'/?|| Allo < |All < n!/?|| Allo. 
11. If A is n-by-n, then n-'/?|| Allo < ||Alloo < n'/?|| Allo. 


12. If A is n-by-n, then n—!|[Alloo < |[Alla < nlJAlloo- 


13. If A is n-by-n, then ||All1 < ||Allp < n!/?|| Alo. 
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Proof. We prove part 7 only and leave the rest to the reader. 

Since A*A is Hermitian, there exists an eigendecomposition A*A = QAQ*, 
with Q a unitary matrix (the columns are eigenvectors), and A = diag(Aj,..., 
An); a diagonal matrix containing the eigenvalues, which must all be real. 
Note that all A; > 0 since if one, say À, were negative, we would take q as 
its eigenvector and get the contradiction 0 < ||Aql|| = q? A? Aq = qf àq = 
A\lq||? < 0. Therefore 


A * AKA 1/2 *OAO* 1/2 
Ally = mex lA — moy CAAD — agg AGO) 
z=0_||ælļ2  z=0 llæll2 z=0 IlzIl2 
(Cea i u (y* Ay)? EAn 
= max = max ———*—. = max 
z=0 |Q*x)||2 llyll2 y=0 y Du 


y=0 
2 
yi 

< max y Amax 2 5 = N Amax; 
y=0 Yi 


which is attainable by choosing y to be the appropriate column of the identity 
matrix. 


1.8. References and Other Topics for Chapter 1 


At the end of each chapter we will list the references most relevant to that 
chapter. They are also listed alphabetically in the bibliography at the end. In 
addition we will give pointers to related topics not discussed in the main text. 

The most modern comprehensive work in this area is by G. Golub and C. 
Van Loan [119], which also has an extensive bibliography. A recent undergrad- 
uate level or beginning graduate text in this material is by D. Watkins [250]. 
Another good graduate text is by L. Trefethen and D. Bau [241]. A classic 
work that is somewhat dated but still an excellent reference is by J. Wilkinson 
[260]. An older but still excellent book at the same level as Watkins is by G. 
Stewart [233]. 

More detailed information on error analysis can be found in the recent book 
by N. Higham [147]. Older but still good general references are by J. Wilkinson 
[259] and W. Kahan [155]. 

“What every computer scientist should know about floating point arith- 
metic” by D. Goldberg is a good recent survey [117]. IEEE arithmetic is de- 
scribed formally in [11, 12, 157] as well as in the reference manuals published 
by computer manufacturers. Discussion of error analysis with IEEE arithmetic 
may be found in [53, 69, 157, 156] and the references cited therein. 

A more general discussion of condition numbers and the distance to the 
nearest ill-posed problem is given by the author in [70] as well as in a series 
of papers by S. Smale and M. Shub [217, 218, 219, 220]. Vector and matrix 
norms are discussed at length in [119, sects. 2.2, 2.3]. 
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1.9. Questions for Chapter 1 


QUESTION 1.1. (Easy; Z. Bai) Let A be an orthogonal matrix. Show that 
det(A) = +1. Show that if B also is orthogonal and det(A) = —det(B), then 
A+ B is singular. 


QUESTION 1.2. (Easy; Z. Bai) The rank of a matrix is the dimension of the 
space spanned by its columns. Show that A has rank one if and only if A = ab” 
for some column vectors a and b. 


QUESTION 1.3. (Easy; Z. Bai) Show that if a matrix is orthogonal and trian- 
gular, then it is diagonal. What are its diagonal elements? 


QUESTION 1.4. (Easy; Z. Bai) A matrix is strictly upper triangular if it is 
upper triangular with zero diagonal elements. Show that if A is strictly upper 
triangular and n-by-n, then A” = 0. 


QUESTION 1.5. (Easy; Z. Bai) Let ||- || be a vector norm on R” and assume 
that C € R™*”". Show that if rank(A) = n, then ||2||c = ||Cz|| is a vector 
norm. 


QUESTION 1.6. (Easy; Z. Bai) Show that if 0 = s € R” and E € R”*”, then 


T\ 1/2 2 
8s |Es|l5 
E {| I — — = | El - R. 
| ( N lle sls 


QUESTION 1.7. (Easy; Z. Bai) Verify that ||ry”||r = ||zy” ||2 = ||2|l2|lyl|2 for 
any x,y E€ C”. 


QUESTION 1.8. (Medium) One can identify the degree d polynomials p(x) = 
ae a,x’ with R¢! via the vector of coefficients. Let x be fixed. Let S, be 
the set of polynomials with an infinite relative condition number with respect 
to evaluating them at x (i.e., they are zero at x). In a few words, describe Sx 
geometrically as a subset of Rt, Let $,(«) be the set of polynomials whose 
relative condition number is K or greater. Describe S;(K) geometrically in a 
few words. Describe how S(x) changes geometrically as k — oo. 


QUESTION 1.9. (Medium; from the 1995 final exam) Consider the figure be- 
low. It plots the function y = log(1 + «)/x computed in two different ways. 
Mathematically, y is a smooth function of x near x = 0, equaling 1 at 0. But 
if we compute y using this formula, we get the plots on the left (shown in the 
ranges x € [1,1] on the top left and x € [—10715, 10715] on the bottom left). 
This formula is clearly unstable near x = 0. On the other hand, if we use the 
algorithm 
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d=l1+2 
if d = 1 then 


we get the two plots on the right, which are correct near x = 0. Explain this 
phenomenon, proving that the second algorithm must compute an accurate 
answer in floating point arithmetic. Assume that the log function returns an 
accurate answer for any argument. (This is true of any reasonable implemen- 
tation of logarithm.) Assume IEEE floating point arithmetic if that makes 
your argument easier. (Both algorithms can malfunction on a Cray machine.) 


y = log(1+x)/x y = log(1+x)/[(1+x)-1] 
3 3 
2 2 
1 1 
0 0 
| -1 
-1 -0.5 0 0.5 1 -1 -0.5 0 0.5 1 
y = log(1+x)/x y = log(1+x)/[(1+x)-1] 
3 3 
2 2 
1 1 
0 0 
z -1 
—0.80.6-0.40.2 0 0.20.40.60.8 -0.80.60.40.2 0 0.20.40.60.8 
-15 -15 
x 10 x 10 


QUESTION 1.10. (Medium) Show that, barring overflow or underflow, 
KODAR Ziyi) = ey xiyi(1 + 6;), where |6;| < de. Use this to prove the 
following fact. Let A™*” and B”"*? be matrices, and compute their product 
in the usual way. Barring overflow or underflow show that |fl(A-B)—-A-B| < 
n-e-|A|-|B|. Here the absolute value of a matrix |A| means the matrix with 
entries (|A|);; = |a;;|, and the inequality is meant componentwise. 

The result of this question will be used in section 2.4.2, where we analyze 
the roundoff errors in Gaussian elimination. 
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QUESTION 1.11. (Medium) Let L be a lower triangular matrix and solve La = 
b by forward substitution. Show that barring overflow or underflow, the com- 
puted solution ĉ satisfies (L + 6L)% = b, where |ôl;;| < ne|lij|, where € is the 
machine precision. This means that forward substitution is backward stable. 
Argue that backward substitution for solving upper triangular systems satisfies 
the same bound. 

The result of this question will be used in section 2.4.2, where we analyze 
the roundoff errors in Gaussian elimination. 


QUESTION 1.12. (Medium) In order to analyze the effects of rounding errors, 
we have used the following model (see equation (1.1)): 


fl(a® b) = (a ©b)(1 +ô), 


where © is one of the four basic operations +, —, *, and /, and |6| < £. To show 
that our analyses also work for complex data, we need to prove an analogous 
formula for the four basic complex operations. Now ô will be a tiny complex 
number bounded in absolute value by a small multiple of e. Prove that this 
is true for complex addition, subtraction, multiplication, and division. Your 
algorithm for complex division should successfully compute a/a ~ 1, where 
|a| is either very large (larger than the square root of the overflow threshold) 
or very small (smaller than the square root of the underflow threshold). Is it 
true that both the real and imaginary parts of the complex product are always 
computed to high relative accuracy? 


QUESTION 1.13. (Medium) Prove Lemma 1.3. 
QUESTION 1.14. (Medium) Prove Lemma 1.5. 


QUESTION 1.15. (Medium) Prove Lemma 1.6. 


QUESTION 1.16. (Medium) Prove all parts except 7 of Lemma 1.7. Hint for 
part 8: Use the fact that if X and Y are both n-by-n, then XY and YX have 
the same eigenvalues. Hint for part 9: Use the fact that a matrix is normal if 
and only if it has a complete set of orthonormal eigenvectors. 


QUESTION 1.17. (Hard; W. Kahan) We mentioned that on a Cray machine 
the expression arccos(x//x? + y?) caused an error, because roundoff caused 
(a/./x? +y?) to exceed 1. Show that this is impossible using IEEE arithmetic, 
barring overflow or underflow. Hint: You will need to use more than the simple 
model f(a © b) = (a © b)(1 + ô) with |ô| small. Think about evaluating V‘x?, 
and show that, barring overflow or underflow, fl(W2) = x exactly; in numerical 
experiments done by A. Liu, this failed about 5% of the time on a Cray YMP. 
You might try some numerical experiments and explain them. Extra credit: 
Prove the same result using correctly rounded decimal arithmetic. (The proof 
is different.) This question is due to W. Kahan, who was inspired by a bug in 
a Cray program of J. Sethian. 
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QUESTION 1.18. (Hard) Suppose a and 6 are normalized IEEE double pre- 
cision floating point numbers, and consider the following algorithm, running 
with IEEE arithmetic: 


if (|a| < |b|), swap a and b 
sı=a+b 
s2 = (a — sı) +b 


Prove the following facts: 


1. Barring overflow or underflow, the only roundoff error committed in run- 
ning the algorithm is computing sı = fl(a + b). In other words, both 
subtractions sı — a and (sı — a) — b are computed exactly. 


2. sı +s2 = a+b, exactly. This means that s is actually the roundoff error 
committed when rounding the exact value of a+ b to get sı. 


Thus, this program in effect simulates quadruple precision arithmetic, repre- 
senting the true sum a + b as the higher-order bits (s1) and the lower-order 
bits (s2). 

Using this and similar tricks in a systematic way, it is possible to effi- 
ciently simulate all four basic floating point operations in arbitrary precision 
arithmetic, using only the underlying floating point instructions and no “bit- 
fiddling” [202]. 128-bit arithmetic is implemented this way on the IBM RS6000 
and Cray (but much less efficiently on the Cray, which does not have IEEE 
arithmetic). 


QUESTION 1.19. (Hard; Programming) This question illustrates the challenges 
in engineering highly reliable numerical software. Your job is to write a pro- 
gram to compute the two-norm s = ||z\|2 = (X; 27)? given 21, ldots, £n. 
The most obvious (and inadequate) algorithm is 


s=0 

forz=1ton 
s=s+2? 

endfor 

s = sqrt(s) 


This algorithm is inadequate because it does not have the following desirable 
properties: 


1. It must compute the answer accurately (i.e., nearly all the computed 
digits must be correct) unless ||2||2 is (nearly) outside the range of nor- 
malized floating point numbers. 


2. It must be nearly as fast as the obvious program above in most cases. 
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3. It must work on any “reasonable” machine, possibly including ones not 
running IEEE arithmetic. This means it may not cause an error condi- 
tion, unless ||x||2 is (nearly) larger than the largest floating point number. 


To illustrate the difficulties, note that the obvious algorithm fails when n = 1 
and zı is larger than the square root of the largest floating point number (in 
which case x? overflows, and the program returns +00 in IEEE arithmetic and 
halts in most non-IEEE arithmetics) or when n = 1 and 2 is smaller than the 
square root of the smallest normalized floating point number (in which case 
x? underflows, possibly to zero, and the algorithm may return zero). Scaling 
the x; by dividing them all by max; |x;| does not have property 2), because 
division is usually many times more expensive than either multiplication or 
addition. Multiplying by c = 1/ max; |z;| risks overflow in computing c, even 
when max; |x;| > 0. 

This routine is important enough that it has been standardized as a Basic 
Linear Algebra Subroutine, or BLAS, which should be available on all machines 
[167]. We discuss the BLAS at length in section 2.6.1, and documentation 
and sample implementations may be found at NETLIB/blas. In particular, 
see NETLIB/cgi-bin/netlibget.pl/blas/snrm2.f for a sample implementation 
that has properties 1) and 3) but not 2). These sample implementations are 
intended to be starting points for implementations specialized to particular 
architectures (an easier problem than producing a completely portable one, as 
requested in this problem). Thus, when writing your own numerical software, 
you should think of computing ||2||2 as a building block that should be available 
in a numerical library on each machine. 

For another careful implementation of ||z||2, see [34]. 

You can extract test code from NETLIB/blas/sblat1 to see if your imple- 
mentation is correct; all implementations turned in must be thoroughly tested 
as well as timed, with times compared to the obvious algorithm above on those 
cases where both run. See how close to satisfying the three conditions you can 
come; the frequent use of the word “nearly” in conditions (1), (2) and (3) 
shows where you may compromise in attaining one condition in order to more 
nearly attain another. In particular, you might want to see how much easier 
the problem is if you limit yourself to machines running IEEE arithmetic. 

Hint: Assume that the values of the overflow and underflow thresholds are 
available for your algorithm. Portable software for computing these values is 
available (see NETLIB/cgi-bin/netlibget .pl/lapack/util/slamch.f). 


QUESTION 1.20. (Easy; Medium) We will use a Matlab program to illustrate 


how sensitive the roots of polynomial can be to small perturbations in the 
coefficients. The program is available? at HOMEPAGE/Matlab/polyplot.m. 


5Recall that we abbreviate the URL prefix of the class homepage to HOMEPAGE in the 
text. 
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Polyplot takes an input polynomial specified by its roots r and then adds 
random perturbations to the polynomial coefficients, computes the perturbed 


roots, and plots them. The inputs are 
r = vector of roots of the polynomial, 


e = maximum relative perturbation to make to each coefficient of 
the polynomial, 

m = number of random polynomials to generate, whose roots are 
plotted. 


1. (Easy) The first part of your assignment is to run this program for the 
following inputs. In all cases choose m high enough that you get a fairly 
dense plot but don’t have to wait too long. m = a few hundred or perhaps 
1000 is enough. You may want to change the axes of the plot if the graph 
is too small or too large. 

e r=(1:10); e = le-3, le-4, le-5, le-6, le-7, 1e-8, 
e r=(1:20); e = le-9, le-11, le-13, 1e-15, 


e r=(2,4,8,16,..., 1024]; e=le-1, le-2, le-3, le-4 (in this case, use 
axis({.1,le4,-4,4]) and semilogx(real(r1),imag(r1),’.’) ) 
Also try your own example with complex conjugate roots. Which roots 
are most sensitive? 


2. (Medium) The second part of your assignment is to modify the program 
to compute the condition number c(i) for each root. In other words, a 
relative perturbation of e in each coefficient should change root r(i) by 
at most about e*c(i). Modify the program to plot circles centered at r(i) 
with radii e*c(i), and confirm that these circles enclose the perturbed 
roots (at least when e is small enough that the linearization used to 
derive the condition number is accurate). You should turn in a few plots 
with circles and perturbed eigenvalues, and some explanation of what 
you observe. 


3. (Medium) In the last part, notice that your formula for c(i) “blows up” if 
p'(r(i)) = 0. This condition means that r(i) is a multiple root of p(x) = 0. 
We can still expect some accuracy in the computed value of a multiple 
root, however, and in this part of the question, we will ask how sensitive 
a multiple root can be: First, write p(x) = q(x) - (x — r(i))™, where 
q(r(i)) = 0 and m is the multiplicity of the root r(i). Then compute the 
m roots nearest r(i) of the slightly perturbed polynomial p(x) — q(x)e, 
and show that they differ from r(i) by |e|!/". So that if m = 2, for 
instance, the root r(i) is perturbed by e!/?, which is much larger than 
c€ if |e| < 1. Higher values of m yield even larger perturbations. If e is 
around machine epsilon and represents rounding errors in computing the 
root, this means an m-tuple root can lose all but 1/m-th of its significant 
digits. 
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QUESTION 1.21. (Medium) Apply Algorithm 1.1, Bisection, to find the roots 
of p(x) = (x — 2)? = 0, where p(x) is evaluated using Horner’s rule. Use the 
Matlab implementation in HOMEPAGE/Matlab/bisect.m, or else write your 
own. Confirm that changing the input interval slightly changes the computed 
root drastically. Modify the algorithm to use the error bound discussed in the 
text to stop bisecting when the roundoff error in the computed value of p(x) 
gets so large that its sign cannot be determined. 
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2.1. Introduction 


This chapter discusses perturbation theory, algorithms, and error analysis for 
solving the linear equation Az = b. The algorithms are all variations on 
Gaussian elimination. They are called direct methods, because in the absence 
of roundoff error they would give the exact solution of Ax = b after a finite 
number of steps. In contrast, Chapter 6 discusses iterative methods, which 
compute a sequence £o, £1, £2, ... of ever better approximate solutions of Ar = 
b; one stops iterating (computing the next 2;41) when zx; is accurate enough. 
Depending on the matrix A and the speed with which x; converges to x = Atb, 
a direct method or an iterative method may be faster or more accurate. We 
will discuss the relative merits of direct and iterative methods at length in 
Chapter 6. For now, we will just say that direct methods are the methods of 
choice when the user has no special knowledge about the source® of matrix A 
or when a solution is required with guaranteed stability and in a guaranteed 
amount of time. 

The rest of this chapter is organized as follows. Section 2.2 discusses per- 
turbation theory for Ax = b; it forms the basis for the practical error bounds 
in section 2.4. Section 2.3 derives the Gaussian elimination algorithm for dense 
matrices. Section 2.4 analyzes the errors in Gaussian elimination and presents 
practical error bounds. Section 2.5 shows how to improve the accuracy of a 
solution computed by Gaussian elimination, using a simple and inexpensive 
iterative method. To get high speed from Gaussian elimination and other 
linear algebra algorithms on contemporary computers, care must be taken to 
organize the computation to respect the computer memory organization; this 
is discussed in section 2.6. Finally, section 2.7 discusses faster variations of 
Gaussian elimination for matrices with special properties commonly arising in 
practice, such as symmetry (A = A”) or sparsity (when many entries of A are 
zero). 


°For example, in Chapter 6 we consider the case when A arises from approximating the 
solution to a particular differential equation, Poisson’s equation. 
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Sections 2.2.1 and 2.5.1 discuss recent innovations upon which the software 
in the LAPACK library depends. 

There are a variety of open problems, which we shall mention as we go 
along. 


2.2. Perturbation Theory 


Suppose Az = b and (A + 6A)% = b + 6b; our goal is to bound the norm of 
ôx = ĉ— x. We simply subtract these two equalities and solve for 6x: one way 
to do this is to take 


(A+ dA)(a+6x2) = b+0b 
— [Ax = b] 
Axrx+(A+ô8A)ôx = ôb 
and rearrange to get 
6a = A~\(—6A& + 60). (2.1) 


Taking norms and using part 1 of Lemma 1.7 as well as the triangle inequality 
for vector norms, we get 


öxl] < ATIA: [él] + Ibl). (2.2) 


(We have assumed that the vector norm and matrix norm are consistent, as 
defined in section 1.7. For example, any vector norm and its induced matrix 
norm will do.) We can further rearrange this inequality to get 


ôx || z lôA]|| ||| 
z ea AI: ! aie (2.3) 
||| |All |All fêl 
The quantity «(A) = ||A7+]| - || Al] is the condition number’ of the matrix 
A, because it measures the relative change Pa in the answer as a multiple 


of the relative change AP in the data. (To be rigorous, we need to show 


that inequality (2.2) is an equality for some nonzero choice of 6A and ôb; 
otherwise «(A) would only be an upper bound on the condition number. See 


Question 2.3.) The quantity multiplying «(A) will be small if 6A and 6b are 
152 
el * 

The upper bound depends on ôx (via ĉ), which makes it seem hard to 
interpret, but it is actually quite useful in practice, since we know the computed 
solution ĉ and so can straightforwardly evaluate the bound. We can also derive 


a theoretically more attractive bound that does not depend on ôx as follows: 


small, yielding a small upper bound on the relative error 


"More pedantically, it is the condition number with respect to the problem of matrix 
inversion. The problem of finding the eigenvalues of A, for example, has a different condition 
number. 
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LEMMA 2.1. Let ||- || satisfy ||AB|| < || A|| - ||B||. Then ||X|| < 1 implies that 
I — X is invertible, (I — X)! = 02, XŻ, and |U - X) < sox - 


Proof. The sum $`; X’ is said to converge if and only if it converges in 
each component. We use the fact (from applying Lemma 1.4 to Example 1.6) 
that for any norm, there is a constant c such that |a;,| < e- ||X||. We then 
get (X jk] < c- ||X"|| < e- || X||’, so each component of X X* is dominated by 
a convergent geometric series uly and must converge. Therefore Sn = 
J; o X’ converges to some S as n > oo, and (I — X)S, = I- X)(I + 
X +X? +-+ X”) = I- X — I as n > œ, since ||X*| < ||X||* > 0. 
Therefore (I — X) S = I and S = (I — X)~}. The final bound is ||(I— X)~1]| = 
E Xl) < a < Ho XIE = obey. 

Solving our first equation dAx + (A + 6A)dx = ôb for dx yields 


bx (A +6A)~\(—6 Ax + ôb) 
= [A(I+A7'6A)]~1(—6 Ax + ôb) 
(I + A~15A)~1A71(—6 Ax + ôb). 


Taking norms, dividing both sides by ||x||, using part 1 of Lemma 1.7 and the 
triangle inequality, and assuming that 5A is small enough so that ||A7!dA|| < 
|| A~4|| - ||6 Al] < 1, we get the desired bound: 


||| tep 1 Iôb]| 
=r |I + A~ A) -||A dA|| + —— 
[el Pee tia 
[A-I [50] 
< All + =— ; 
< Ija eA I6 A|| + izi by Lemma 2.1 
— IAI] -ILAI (H „ lll ) 
14a) A at CAT * TAI Ue 
A A 
TEOR on 
since ||b|| = || Axl] < |A|]; læ. 
This bound expresses the relative error =! in the solution as a multiple 
I[5Al] [ðb] 


of the relative errors ap and Ja in the input. The multiplier, K(A)/(1 — 


n(A) Pal), is close to the condition number «(A) if ||6 Al] is small enough. 

The next theorem explains more about the assumption that ||A~1]|-||6Al| = 
K(A) - pal < 1: it guarantees that A + 6A is nonsingular, which we need for 
ôx to exist. It also establishes a geometric characterization of the condition 


number. 


THEOREM 2.1. Let A be nonsingular. Then 


7 1 bl 
A=- Alz  s(4) | 


A 
min { o-Alle :A+06A singular ) 
||All2 
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Therefore, the distance to the nearest singular matrix (ill-posed problem) = 
1 


condition number’ 


Proof. It is enough to show min {||dAll2 : A + 6A singular} = [= 


ll2 
To show this minimum is at least [= note that if ||OAll2 < TAT 


then 1 > ||ôAl]l2 - || ATt]]2 > || ATtSA]|2, so Lemma 2.1 implies that T + A718 A 
is invertible, and so A + 6A is eee 
To show the minimum equals [a= [1 , we construct a 6A of norm 1h 


a4 
such that A + 6A is singular. Note that since ||A~?||2 = max,— ae 
there exists an x such that ||z||2 = 1 and lA" = = ||A-tal|l2 > 0. Now let 
Ans y7 
= parap = fA © llulla = 1 Let 3A = (eh. 
Then 
lzy" zll ly“ z| [lell2 1 
|S Allo = max Tt jz, 7 max = — 
0 At] Izl 250 [zll2 A712 AT? Ife” 


where the maximum is attained when z is any nonzero multiple of y, and A+6A 
is singular because 


T 
xy’ y z z 
(A+ 6A)y = Ay = =0. 
laH IAH A= 


We have now seen that the distance to the nearest ill-posed problem equals 
the reciprocal of the condition number for two problems: polynomial evaluation 
and linear equation solving. This reciprocal relationship is quite common in 
numerical analysis [70]. 

Here is a slightly different way to do perturbation theory for Ax = b; we 
will need it to derive practical error bounds later in section 2.4.4. If ĉ is any 
vector, we can bound the difference 6x = ĉ— x = ĉ— A7!b as follows. We let 
r = Aĉ — b be the residual of ĉ; the residual r is zero if ¢ = x. This lets us 
write dz = A7'r, yielding the bound 


[|x| = A*r] < AT lr. (2.5) 


This simple bound is attractive to use in practice, since r is easy to compute, 
given an approximate solution ĉ. Furthermore, there is no apparent need to 
estimate 6A and 6b. In fact our two approaches are very closely related, as 
shown by the next theorem. 


THEOREM 2.2. Let r= Aĉ -— b. Then there exists a 6A such that ||OAl| = i zl 
and (A+ dA)& = b. No 6A of smaller norm and satisfying (A + ĉA)ĉ = b 
exists. Thus, 5A is the smallest possible backward error (measured in norm). 
This is true for any vector norm and its induced norm (or ||- ||2 for vectors 
and ||- || for matrices). 
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Proof. (A+6A)% = b if and only if 6A-% = b— Aĉ = ~r, so ||r|| = ||JOA-&|| < 
|| Al - ||@||, implying ||d.A]| > i. We complete the proof only for the two-norm 


and its induced matrix norm. Choose 6A = ae We can easily verify that 
2 


5A- ê= —r and ||6Alj2 = KR. 
Thus, the smallest ||6A|| that could yield an ĉ satisfying (A+6A)# = b and 
r = Ai — bis given by Theorem 2.2. Applying error bound (2.2) (with 6b = 0) 


yields 


Jacl) < 14i (El pan) = WAI e; 
llêll 
the same bound as (2.5). 
All our bounds depend on the ability to estimate the condition number 
|| Al| || A7"||. We return to this problem in section 2.4.3. Condition number 
estimates are computed by LAPACK routines such as sgesvx. 


2.2.1. Relative Perturbation Theory 


In the last section we showed how to bound the norm of the error ĝx = ĉ — x 
in the approximate solution ĉ of Ax = b. Our bound on ||ôx|| was proportional 
to the condition number «(A) = || A|| - ||A~1]| times the norms ||6 A] and ||6b]|, 
where ĉ satisfies (A + 6A)% = b + ôb. 

In many cases this bound is quite satisfactory, but not always. Our goal in 
this section is to show when it is too pessimistic and to derive an alternative 
perturbation theory that provides tighter bounds. We will use this perturba- 
tion theory later in section 2.5.1 to justify the error bounds computed by the 
LAPACK subroutines like sgesvx. 

This section may be skipped on a first reading. 

Here is an example where the error bound of the last section is much too 
pessimistic. 


EXAMPLE 2.1. Let A = diag(y,1) (a diagonal matrix with entries aj; = 7 
and a2 = 1) and b = [y,1]?, where y > 1. Then z = A~1d = [1,1]T. Any 
reasonable direct method will solve Ax = b very accurately (using two divisions 
b;/ay) to get ĉ, yet the condition number «(A) = y may be arbitrarily large. 
Therefore our error bound (2.3) may be arbitrarily large. 

The reason that the condition number «(A) leads us to overestimate the 
error is that bound (2.2), from which it comes, assumes that 6A is bounded 
in norm, but is otherwise arbitrary; this is needed to prove that bound (2.2) 
is attainable in Question 2.3. In contrast, the 6A corresponding to the actual 
rounding errors is not arbitrary but has a special structure not captured by 
its norm alone. We can determine the smallest 6A corresponding to < for 
our problem as follows: A simple rounding error analysis shows that ĉ; = 
(b;/ai)/(1 + ôi), where |d;| < e. Thus (aii + d;a4;)%; = bi. We may rewrite this 
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as (A+ 6A)& = b, where 6A = diag(ô1a11, 62a22). Then ||ôA]| can be as large 
max; |€a;;| = ey. Applying error bound (2.3) with ôb = 0 yields 


In contrast, the actual error satisfies 


llðz|lo ê — z|lo 
= | (b1/a11)/(1 + 61) — (b1/a11) JI 
(b2/a22)/(1 + 62) — (b2/a22) Jlo 


E | ~51/(1 + 4) | 


—62/(1 + 62) J} lle, 
E 
< 
~ dl-e 
or 
Ale < e/(1 tats awe 
llêlloo 


which is about y times smaller. o 


For this example, we can describe the structure of the actual 6A as follows: 
|da;;| < elaij|, where € is a tiny number. We write this more succinctly as 


I5A| < €l] (2.6) 


(see section 1.1 for notation). We also say that 6A is a small componentwise 
relative perturbation in A. Since 6A can often be made to satisfy bound (2.6) in 
practice, along with |6b| < e|b| (see section 2.5.1), we will derive perturbation 
theory using these bounds on 6A and 6b. 

We begin with equation (2.1): 


da = A~'(—6 A% + ôb). 
Now take absolute values, and repeatedly use the triangle inequality to get 


lõz| = |A-1(—5Az + 50) 
< |A7*|(\6A] - |@| + |60)) 
< |A7"|(eAl- [2] + elbl) 
= «(|A*|(jA| - |@| + [8))). 


Now using any vector norm (like the infinity-, one-, or Frobenius norms), where 
Izl || = |lz|], we get the bound 


lôzl] < elll A*A] - |e] + lb). (2.7) 
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Assuming for the moment that 6b = 0, we can weaken this bound to 
[|S] < el|]A~*]- |Alll -lê 


or 


llõz|| z 
< ell] A=" + JA] (2.8) 
| 
This leads us to define kcr(A) = |||A~*| - | A] || as the componentwise relative 


condition number of A, or just relative condition number for short. It is some- 
times also called the Bauer condition number [26] or Skeel condition number 
(223, 224, 225]. For a proof that bounds (2.7) and (2.8) are attainable, see 
Question 2.4. 

Recall that Theorem 2.1 related the condition number «(A) to the distance 
from A to the nearest singular matrix. For a similar interpretation of KcR(A), 
see [71, 206]. 


EXAMPLE 2.2. Consider our earlier example with A = diag(y,1) and b = 
[y,1]7. It is easy to confirm that KoR(A) = 1, since |A~!|- |A| = I. Indeed, 
kcrR(A) = 1 for any diagonal matrix A, capturing our intuition that a diagonal 
system of equations should be solvable quite accurately. © 


More generally, suppose D is any nonsingular diagonal matrix and B is an 
arbitrary nonsingular matrix. Then 


Kor(DB) = || \(DB)~"|-|(DB)| | 
= || |B-'D™|-|DB\ || 
=_|| |B|- |B] || 
= kcor(B). 


This means that if DB is badly scaled, i.e., B is well-conditioned but DB 
is badly conditioned (because D has widely varying diagonal entries), then 
we should hope to get an accurate solution of (DB)x = b despite DB’s ill- 
conditioning. This is discussed further in sections 2.4.4, 2.5.1, and 2.5.2. 

Finally, as in the last section we provide an error bound using only the 
residual r = Aĉ — b: 


[|x| = A*r] <A Ir] l (2.9) 


where we have used the triangle inequality. In section 2.4.4 we will see that 
this bound can sometimes be much smaller than the similar bound (2.5), in 
particular when A is badly scaled. There is also an analogue to Theorem 2.2 
[191]. 


THEOREM 2.3. The smallest € > 0 such that there exist |A| < e| A| and |ôb| < 
e|b| satisfying (A+0A)&% = b+0b is called the componentwise relative backward 
error. It may be expressed in terms of the residual r = Aĉ — b as follows: 


E [rı] 
€ = max 


i (JA lê] + [bl 
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For a proof, see Question 2.5. 
LAPACK routines like sgesvx compute the componentwise backward rel- 
ative error e (the LAPACK variable name for e is BERR). 


2.3. Gaussian Elimination 


The basic algorithm for solving Ax = b is Gaussian elimination. To state it, 
we first need to define a permutation matriz. 


DEFINITION 2.1. A permutation matrix P is an identity matrix with permuted 
rows. 


The most important properties of a permutation matrix are given by the 
following lemma. 


LEMMA 2.2. Let P, Pi, and Pz be n-by-n permutation matrices and X be an 
n-by-n matrix. Then 


1. PX is the same as X with its rows permuted. XP is the same as X with 
its columns permuted. 


2. Pot = PT. 


3. det(P) =+1. 
4. P,- Py is also a permutation matriz. 


For a proof, see Question 2.6. 
Now we can state our overall algorithm for solving Ax = b. 


ALGORITHM 2.1. Solving Ax = b using Gaussian elimination: 


1. Factorize A into A= PLU, where 


P = permutation matriz, 
L = unit lower triangular matriz (i.e., with ones on the diagonal), 
U = nonsingular upper triangular matris. 


2. Solve PLUx = b for LUx by permuting the entries of b: LUx = Ptb = 
PY, 


3. Solve LUx = Ptb for Ux by forward substitution: Ux = L~!(P7'b). 


4. Solve Ux = L~!(P~1b) for x by back substitution: x =U~'(L~'P~1b). 


We will derive the algorithm for factorizing A = PLU in several ways. We 
begin by showing why the permutation matrix P is necessary. 
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DEFINITION 2.2. The leading j-by-j principal submatrix of A is A(1 : 7,1: 9). 


THEOREM 2.4. The following two statements are equivalent: 


1. There exists a unique unit lower triangular L and nonsingular upper 
triangular U such that A= LU. 


2. All leading principal submatrices of A are nonsingular. 


Proof. We first show (1) implies (2). A = LU may also be written 


Ee a] = Ee 0 Wee | 
Ag; Ao Lo, Lo 0 U22 
| LiU Lila | 
L21Ui1 LUi + L22U22 


where Aj, is a j-by-j leading principal submatrix, as are Lj, and U11. There- 
fore det Ai = det (L11U11) = det Lii det Uii =f IB- (Ui )kk = 0, since L is 
unit triangular and U is triangular. 

We prove that (2) implies (1) by induction on n. It is easy for 1-by-1 
matrices: a = 1 - a. To prove it for n-by-n matrices A, we need to find unique 
(n — 1)-by-(n — 1) triangular matrices L and U, unique (n — 1)-by-1 vectors | 
and u, and a unique nonzero scalar 7 such that 


j- 4 5] _| 0 U u| | LU Lu 
e S a 0 n| | ZU Fu++n 


By induction, unique L and U exist such that A = LU. Now let u = L~', 
IT = cT UTL, and n = ô — lTu, all of which are unique. The diagonal entries of 
U are nonzero by induction, and 7 = 0 since 0 = det(A) = det (U) - n. 
Thus LU factorization without pivoting can fail on (well-conditioned) non- 


singular matrices such as the permutation matrix 


010 
P=|0 01|; 
100 


the 1-by-1 and 2-by-2 leading principal minors of P are singular. So we need 
to introduce permutations into Gaussian elimination. 


THEOREM 2.5. If A is nonsingular, then there exist permutations P) and Po, 
a unit lower triangular matrix L, and a nonsingular upper triangular matrix 
U such that P.AP = LU. Only one of Pi and P> is necessary. 

Note: P,A reorders the rows of A, AP: reorders the columns, and P, AP> 
reorders both. 
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Proof. As with many matrix factorizations, it suffices to understand block 
2-by-2 matrices. More formally, we use induction on the dimension n. It is 
easy for 1-by-1 matrices: Pi = P> = L = 1 and U = A. Assume that it is 
true for dimension n — 1. If A is nonsingular, then it has a nonzero entry; 
choose permutations P] and P; so that the (1,1) entry of P| AP} is nonzero. 
(We need only one of P/ and P} since nonsingularity implies that each row and 
each column of A has a nonzero entry.) 

Now we write the desired factorization and solve for the unknown compo- 


nents: 
Pait au Age |) 1 0 ui U12 
PA | Ag, A22 | E | La I | 0 Äx» | 
u11 U12 
Š Pea | 2.10 
| Loru La1Ui2 + A22 | eae) 


where Ag and Ag» are (n — 1)-by-(n — 1). 

Solving for the components of this 2-by-2 block factorization we get u11 = 
ay. = 0, Ui2 = A12, and Le1u11 = A21. Since u11 = a11 = 0, we can solve for 
Lo, = Aan . Finally, Lə1U12 + Ag = = Áə implies Ag = = Agg — Lg U2. 

We eek to apply induction to Ag:, but to do so we need to check that 
det Ag2 = 0: Since det P! AP} = + det A = 0 and also 


Lap 1 0 un U12 
det P AP,» — det | Lo I | det | 0 Aes 


| =1. (u11 - det Ao), 


then det Ao. must be nonzero. 

Therefore, by induction there exist permutations P, and P so that Pi Ao» P> 
= LU, with L unit lower triangular and U upper triangular and nonsingular. 
Substituting this in the above 2-by-2 block factorization yields 


Pi AP; 
un Ui 
: PEL 0 
1 U11 Ui P> 0 
Loi pri 0 U i PP 


= 1 0 U11 Ui2P» 0 
B pr PŘ Lo L 0 U ; PT 


so we get the desired factorization of A: 


(o a [aael al) 


gt 0 u11 Ui2P» 
Pilg, L 0 U f 


|l 
E A oes E E. 

D 

an 

SO 

— | 

i 


P AP, 
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The next two corollaries state simple ways to choose Pı and P> to guarantee 
that Gaussian elimination will succeed on a nonsingular matrix. 


COROLLARY 2.1. We can choose P} = I and P] so that aj, is the largest entry 
in absolute value in its column, which implies Lo, = da has entries bounded by 
1 in absolute value. More generally, at step i of Gaussian elimination, where 
we are computing the ith column of L, we reorder the rows so that the largest 
entry in the column is on the diagonal. This is called “Gaussian elimination 
with partial pivoting,” or GEPP for short. GEPP guarantees that all entries 
of L are bounded by one in absolute value. 


GEPP is the most common way to implement Gaussian elimination in 
practice. We discuss its numerical stability in the next section. Another more 
expensive way to choose P; and P» is given by the next corollary. It is almost 
never used in practice, although there are rare examples where GEPP fails but 
the next method succeeds in computing an accurate answer (see Question 2.14). 
We discuss briefly it in the next section as well. 


COROLLARY 2.2. We can choose Pi and P; so that aj; is the largest entry 
in absolute value in the whole matrix. More generally, at step i of Gaussian 
elimination, where we are computing the ith column of L, we reorder the rows 
and columns so that the largest entry in the matrix is on the diagonal. This is 
called “Gaussian elimination with complete pivoting,” or GECP for short. 


The following algorithm embodies Theorem 2.5, performing permutations, 
computing the first column of L and the first row of U, and updating A292 to get 
Agog = Av— Lə1U12. We write the algorithm first in conventional programming 
language notation and then using Matlab notation. 


ALGORITHM 2.2. LU factorization with pivoting 


fori=1ton—-1 
apply permutations so ai; = 0 (permute L and U too) 
/* for example, for GEPP, swap rows j andi of A and of L 
where |a;;| is the largest entry in |A(i: n, i)|; 
for GECP, swap rows j andi of A and of L, 
and columns k andi of A and of U, 
where |a;z| is the largest entry in |A(i:n,i:n)| */ 
/* compute column i of L (Lo, in (2.10)) */ 
forj=i+lton 
lji = aji/ dii 
end for 
/* compute row i of U (Uiz in (2.10)) */ 
forj=iton 
Uij = Qij 
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end for 
/* update Av (to get Ago = Agg — Lo U2 in (2.10)) #7 
forj=i+lton 

fork=i+l1ton 

Ajk = Ajk — lji * Uik 

end for 

end for 
end for 


Note that once column i of A is used to compute column i of L, it is never 
used again. Similarly, row 7 of A is never used again after computing row i of 
U. This lets us overwrite L and U on top of A as they are computed, so we 
need no extra space to store them; L occupies the (strict) lower triangle of A 
(the ones on the diagonal of L are not stored explicitly), and U occupies the 
upper triangle of A. This simplifies the algorithm to 


ALGORITHM 2.3. LU factorization with pivoting, overwriting L and U on A: 


fori=1ton—-1 
apply permutations (see Algorithm 2.2 for details) 
forj=i+1 ton 
aji = Aji / Qi 
end for 
forj=i+1 ton 
fork=i+l1ton 
Ajk = Ajk — Aji * Qik 
end for 
end for 
end for 


Using Matlab notation this further reduces to the following algorithm. 
ALGORITHM 2.4. LU factorization with pivoting, overwriting L and U on A: 


fori=1ton—-1 
apply permutations (see Algorithm 2.2 for details) 
AZi+t1:n,i) =AG+1:n,7)/A(Gi,¢) 
Ali+1l:n,i+1:n)= 
Ali+1:n,i+1:n)—Ali+1:n,i)* Ali i+1:n) 
end for 


In the last line of the algorithm, A(i+1:n,7)*A(i,i+1: n) is the product 
of an (n —7)-by-1 matrix (L21) by a 1-by-(n — i) matrix (U12), which yields an 
(n — i)-by-(n — i) matrix. 
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We now rederive this algorithm from scratch starting from perhaps the most 
familiar description of Gaussian elimination: “Take each row and subtract 
multiples of it from later rows to zero out the entries below the diagonal.” 
Translating this directly into an algorithm yields 


fori=1ton-1 /* for each row i */ 
for j =i+lton /* subtract a multiple of 
row 7 from row j ... */ 
fork =iton /* ... in columns i through n ... */ 


ajk = aE aik /* ... to zero out column i 
below the diagonal */ 
end for 
end for 
end for 


We will now make some improvements to this algorithm, modifying it until 
it becomes identical to Algorithm 2.3 (except for pivoting, which we omit). 
First, we recognize that we need not compute the zero entries below the diag- 
onal, because we know they are zero. This shortens the k loop to yield 


for i = 1 ton— 1 
for j =i+1ton 
fr k=i+1ton 
Qik = Ajk — aig 
end for 
end for 
end for 


. . Qji . . 
The next performance improvement is to compute z= outside the inner 
U 
loop, since it is constant within the inner loop. 


for i = 1 ton— 1 
for 7 =i+1ton 


lji = 2 


A 
end for 
for j =i+1ton 
for k=i+1ton 
Ajk = Qjk — ljiūlik 
end for 
end for 
end for 


Finally, we store the multipliers /;; in the subdiagonal entries a;; that we 
originally zeroed out; they are not needed for anything else. This yields Algo- 
rithm 2.3 (except for pivoting). 
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The operation count of LU is done by replacing loops by summations over 
the same range, and inner loops by their operation counts: 


n-1 

2 2 TPI 

i=1 =i+1 j=i+1 k=i4+1 
n-1 


The forward and back substitutions with L and U to complete the solution 
of Ax = b cost O(n), so overall solving Az = b with Gaussian elimination 
costs fn? +O(n”) operations. Here we have used the fact that $7", i* = 
m**1/(k+1)+O(m*). This formula is enough to get the high order term in 
the operation count. 

There is more to implementing Gaussian elimination than writing the 
nested loops of Algorithm 2.2. Indeed, depending on the computer, program- 
ming language, and matrix size, merely interchanging the last two loops on j 
and k can change the execution time by orders of magnitude. We discuss this 
at length in section 2.6. 


2.4. Error Analysis 


Recall our two-step paradigm for obtaining error bounds for the solution of 
Ax =b: 


1. Analyze roundoff errors to show that the result of solving Ax = b is the 
exact solution ĉ of the perturbed linear system (A+6A)# = b+06b, where 
6A and 6b are small. This is an example of backward error analysis, and 
6A and 6b are called the backward errors. 


2. Apply the perturbation theory of section 2.2 to bound the error, for 
example by using bound (2.3) or (2.5). 


We have two goals in this section. The first is to show how to implement 
Gaussian elimination in order to keep the backward errors 6A and 6b small. 


In particular, we would like to keep Well and Ll as small as O(c). This is as 
TAT Ter 
small as we can expect to make them, since mere 


y route the largest entries 
of A (or b) to fit into the floating point format can make fetes (or he): It 
turns out that unless we are careful about pivoting, 6A and 6b need not be 
small. We discuss this in the next section. 

The second goal is to derive practical error bounds which are simultaneously 
cheap to compute and “tight,” i.e., close to the true errors. It turns out that 
the best bounds for ||ôA]|| that we can formally prove are generally much larger 
than the errors encountered in practice. Therefore, our practical error bounds 


(in section 2.4.4) will rely on the computed residual r = A%—b and bound (2.5), 
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instead of bound (2.3). We also need to be able to estimate «(A) inexpensively; 
this is discussed in section 2.4.3. 

Unfortunately, we do not have error bounds that always satisfy our twin 
goals of cheapness and tightness, i.e., that simultaneously 


1. cost a negligible amount compared to solving Ax = b in the first place 
(for example, that cost O(n?) flops versus Gaussian elimination’s O(n?) 
flops), and 


2. provide an error bound that is always at least as large as the true error 
and never more than a constant factor larger (100 times larger, say). 


The practical bounds in section 2.4.4 will cost O(n?) but will on very rare 
occasions provide error bounds that are much too small or much too large. 
The probability of getting a bad error bound is so small that these bounds are 
widely used in practice. The only truly guaranteed bounds use either interval 
arithmetic, very high precision arithmetic, or both and are several times more 
expensive than just solving Ax = b (see section 1.5). 

It has in fact been conjectured that no bound satisfying our twin goals of 
cheapness and tightness exist, but this remains an open problem. 


2.4.1. The Need for Pivoting 


0001 1 
1 1 
decimal-digit floating point arithmetic and see why we get the wrong answer. 
Note that «(A) = ||Alloo : At] & 4, so A is well conditioned and thus we 


should expect to be able to solve Ax = b accurately. 


Let us apply LU factorization without pivoting to A = ] in three 


y= a nao T | , fi(1/10~*) rounds to 104, 

U = ~ a- we 1) i fi(1 — 10*- 1) rounds to — 10+, 
Be ee E 104 À | d E | E | a ; | 
ea) 


Note that the original a2 has been entirely “lost” from the computation by 
subtracting 104 from it. We would have gotten the same LU factors whether 
a22 had been 1, 0, —2, or any number such that fl(ag2 — 104) = —10*. Since 
the algorithm proceeds to work only with L and U, it will get the same answer 
for all these different a22, which correspond to completely different A and 
so completely different z = Atb; there is no way to guarantee an accurate 
answer. This is called numerical instability, since L and U are not the exact 


46 Applied Numerical Linear Algebra 


factors of a matrix close to A. (Another way to say this is that ||A — LU]|| is 
about as large as ||A||, rather than ¢||Al].) 

Let us see what happens when we go on to solve Ax = [1,2]? for x using 
this LU factorization. The correct answer is x ~ [1,1]’. Instead we get the 
following. Solving Ly = [1,2]? yields yı = fl(1/1) = 1 and y2 = fl(2—10*-1) = 
—10*; note that the value 2 has been “lost” by subtracting 104 from it. Solving 
Uĉ = y yields 2 = fl((—10*)/(—10*)) = 1 and #, = fl((1 — 1)/10-*) = 0, a 
completely erroneous solution. 

Another warning of the loss of accuracy comes from comparing the con- 
dition number of A to the condition numbers of L and U. Recall that we 
transform the problem of solving Ax = b into solving two other systems with 
L and U, so we do not want the condition numbers of L or U to be much larger 
than that of A. But here, the condition number of A is about 4, whereas the 
condition numbers of L and U are about 10°. 

In the next section we will show that doing GEPP nearly always eliminates 
the instability just illustrated. In the above example, GEPP would have re- 
versed the order of the two equations before proceeding. The reader is invited 
to confirm that in this case we would get 


= | moi i | a | ni i | 


and 


a, TE ls | 


so that LU approximates A quite accurately. Both L and U are quite well- 
conditioned, as is A. The computed solution vector is also quite accurate. 


2.4.2. Formal Error Analysis of Gaussian Elimination 


Here is the intuition behind our error analysis of LU decomposition. If in- 
termediate quantities arising in the product L-U are very large compared to 
|| A||, the information in entries of A will get “lost” when these large values 
are subtracted from them. This is what happened to agg in the example in 
section 2.4.1. If the intermediate quantities in the product L -U were instead 
comparable to those of A, we would expect a tiny backward error A— LU in the 
factorization. Therefore, we want to bound the largest intermediate quantities 
in the product L. U. We will do this by bounding the entries of the matrix 
|L|-|U| (see section 1.1 for notation). 

Our analysis is analogous to the one we used for polynomial evaluation 
in section 1.6. There we considered p = )>,a;x' and showed that if |p| were 
comparable to the sum of absolute values X; |a;x"|, then p would be computed 
accurately. 

After presenting a general analysis of Gaussian elimination, we will use 
it to show that GEPP (or, more expensively, GECP) will keep the entries of 
|L| -|U| comparable to ||A]| in almost all practical circumstances. 
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Unfortunately, the best bounds on ||¢A|| that we can prove in general are 
still much larger than the errors encountered in practice. Therefore, the error 
bounds that we use in practice will be based on the computed residual r and 
bound (2.5) (or bound (2.9)) instead of the rigorous but pessimistic bound in 
this section. 

Now suppose that matrix A has already been pivoted, so the notation is 
simpler. We simplify Algorithm 2.2 to two equations, one for ajx with j < k 
and one for j > k. Let us first trace what Algorithm 2.2 does to aj, when 
j < k: this element is repeatedly updated by subtracting 1;;u;, for i = 1 to 
j —1 and is finally assigned to uj, so that 


j—1 


Ujk = Ajk — > ljiUik. 


i=1 


When j > k, ajk again has lju; subtracted for i = 1 to k — 1, and then the 
resulting sum is divided by ugg and assigned to ljg: 


k-1 
ajk — Dogar ljiUik 
Ukk 


lik = 


To do the roundoff error analysis of these two formulas, we use the result 
from Question 1.10 that a dot product computed in floating point arithmetic 
satisfies 


d d 
fl (> vat) =Ņ axiyi(1+6;) with |5;] < de. 
i=l i=1 


We apply this to the formula for ujk, yielding® 


j-l 
Ujk = (ex — S pugil + a) (1 + ô’) 


i=1 
with |ð] < (j — Le and || < £. Solving for aj, we get 
aj, = Thy Ujk . Lig + 3 ljiuik(1 +06;) since Lig =l 
= Jaluk + J% ljitikði 
with loi] < (j— 1)e and 1+ ô; = Toy 
= ljug + Ejk, 
where we can bound Ej, by 


j 


So lit tik 6 


i=1 


j 
< Do lyil - lei ne = ne(|L| - Ul); 
i=1 


|E;jk| = 


8Strictly speaking, the next formula assumes we compute the sum first and then subtract 
from ajk. But the final bound does not depend on the order of summation. 
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Doing the same analysis for the formula for lj yields 


ljr = (1+ 6") (£ + 8') (aje = Diz ljiuirll + n) 


Ukk 


with |ð] < (k — 1)e, |F| < £, and |6”| < £. We solve for ajẹ to get 


k—1 
1 

k = j > jiUik(l + 0; 

a Gea TA P 


k k 
, 7 1 
= >, ljiUik + 2, ljiUikði with 1+6,= a = sa " 5) 
k 
= So liu + Ejk 
i=1 
with |6;| < ne, and so |Ejx| < ne(|L| - |U|) jx as before. 

Altogether, we can summarize this error analysis with the simple formula 
A = LU + E where |E| < ne|L|-|U|. Taking norms we get ||E|| < nell |Z I|- 
|| |U| ||. If the norm does not depend on the signs of the matrix entries (true for 
the Frobenius, infinity-, and one-norms but not the two-norm), we can simplify 
this to El] <nellZ\| -|\UI. 

Now we consider solving the rest of the problem: LUx = b via Ly = b 
and Ux = y. The result of Question 1.11 shows that solving Ly = b by 
forward substitution yields a computed solution ĝ satisfying (L+6L)g = b with 
|6L| < ne|L|. Similarly when solving Ux = ĝ we get ĉ satisfying (U+60U)& = ĝ 
with |6U| < ne|U. 

Combining these yields 

b = (L+6L)% 
(L+ 6L)(U + ôU) 
= (LU + LU + ôLU + ôLôU)ĉ 
(A— E+ LOU + 6LU + ôLôU)ĉ 
(A+6A)@ where 6A=—E+L6U + 6LU + 6LOU. 


Now we combine our bounds on E, ôL, and ôU and use the triangle inequality 
to bound 6A: 


|5.Al |— E+ LOU + LU + 5L6U| 

|E| + |LoU]| + |5LU| + |6L5U| 

[E| + |L|- [dU] + [dL] - [U] + o£] - [8U | 

ne|L| -|U| + ne|L| -|U| + ne|L] - |U| + n2e?|L] - |U] 
3ne|L| - |U]. 


IA IA IA 


Q 
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Taking norms and assuming || |X| || = || XI| (true as before for the Frobe- 
nius, infinity-, and one-norms but not the two-norm) we get ||dA|| < 3ne||L]| - 
WI. 

Thus, to see when Gaussian elimination is backward stable, we must ask 
when 3ne||L]| - ||U|| = O(e)||All; then the ial in the perturbation theory 
bounds will be O(£) as we desire (note that ôb = 0). 

The main empirical observation, justified by decades of experience, is that 
GEPP almost always keeps || L]|-||U|| ~ || A||. GEPP guarantees that each entry 
of L is bounded by 1 in absolute value, so we need consider only ||U ||. We define 
the pivot growth factor for GEPP? as gpp = ||U || max/||Al|max, where ||Al|max = 
max;j |a;j|, So stability is equivalent to gpp being small or growing slowly as a 
function of n. In practice, gpp is almost always n or less. The average behavior 
seems to be n2/3 or perhaps even just n'/? [240]. (See Figure 2.1.) This makes 
GEPP the algorithm of choice for many problems. Unfortunately, there are 
rare examples in which gpp can be as large as 2”~!. 


PROPOSITION 2.1. GEPP guarantees that gpp < 2°71. This bound is attain- 
able. 


Proof. The first step of GEPP updates aj, = ajk — lji ` Uik, where |lj;i| < 1 
and |ujz| = |aiz| < max;s |a;s|, so |ãjk| < 2+ MaXrs |ars|. So each of the n — 1 
major steps of GEPP can double the size of the remaining matrix entries, and 
we get 2”~! as the overall bound. See the example in Question 2.14 to see that 
this is attainable. 


Putting all these bounds together, we get 
|S Alloo < 39ppn* ell Alloo, (2.11) 


since ||Lllœo < n and ||U|loo < ngpp||Alloo. The factor 3gppn? in the bound 
causes it to almost always greatly overestimate the true ||dA||, even if gpp = 1. 
For example, if ¢ = 1077 and n = 150, a very modest sized matrix, then 
3n°< > 1, meaning that all precision is potentially lost. Example 2.3 graphs 
3gppn°e along with the true backward error to show how it can be pessimistic; 
||OA|| is usually O(e)|| A||, so we can say that GEPP is backward stable in 
practice, even though we can construct examples where it fails. Section 2.4.4 
presents practical error bounds for the computed solution of Ax = b that are 
much smaller than what we get from using ||5Al|.. < 3gppn%e||A]loo- 


It can be shown that GECP is even more stable than GEPP, with its pivot 
growth gcp satisfying the worst-case bound [260, p. 213] 


pe mMaXij |uss| < Vn es 31/2 ; 41/3 . .. n1/(n-1) & n1/2 + log, n/4. 
Max;j | avg, 


°This definition is slightly different from the usual one in the literature but essentially 
equivalent [119, p. 115]. 
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This upper bound is also much too large in practice. The average behavior of 
gop is n!/?. It was an old open conjecture that gop < n, but this was recently 
disproved [97, 120]. It remains an open problem to find a good upper bound 
for gcp (which is still widely suspected to be O(n).) 

The extra O(n?) comparisons that GECP uses to find the pivots (O(n?) 
comparisons per step, versus O(n) for GEPP) makes GECP significantly slower 
than GEPP, especially on high-performance machines that perform floating 
point operations about as fast as comparisons. Therefore, using GECP is 
seldom warranted (but see sections 2.4.4, 2.5.1, and 5.4.3). 


EXAMPLE 2.3. Figures 2.1 and 2.2 illustrate these backward error bounds. For 
both figures, five random matrices A of each dimension were generated, with 
independent normally distributed entries, of mean 0 and standard deviation 
1. (Testing such random matrices can sometimes be misleading about the 
behavior on some real problems, but it is still informative.) For each matrix, 
a similarly random vector b was generated. Both GEPP and GECP were used 
to solve Ax = b. Figure 2.1 plots the pivot growth factors gpp and gcp. In both 


It also shows the true backward error, computed as described in Theorem 2.2. 
Machine epsilon is indicated by a solid horizontal line at € = 2753 = 1.1-107'°. 
Both bounds are indeed bounds on the true backward error but are too large 
by several order of magnitude. For the Matlab program that produced these 
plots, see HOMEPAGE/Matlab/pivot.m. © 


2.4.3. Estimating Condition Numbers 


To compute a practical error bound based on a bound like (2.5), we need 
to estimate ||A~‘||. This is also enough to estimate the condition number 
k(A) = ||A7+]|-||Al], since ||.A]| is easy to compute. One approach is to compute 
A`! explicitly and compute its norm. However, this would cost 2n°, more than 
the original Èn’ for Gaussian elimination. (Note that this implies that it is not 
cheaper to solve Ax = b by computing A~! and then multiplying it by b. This 
is true even if one has many different b vectors. See Question 2.2.) It is a fact 
that most users will not bother to compute error bounds if they are expensive. 

So instead of computing A~! we will devise a much cheaper algorithm to 
estimate ||A~'||. Such an algorithm is called a condition estimator and should 
have the following properties: 


1. Given the L and U factors of A, it should cost O(n”), which for large 
enough n is negligible compared to the n? cost of GEPP. 


2. It should provide an estimate which is almost always within a factor of 10 
of || ATt}||. This is all one needs for an error bound which tells you about 
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Pivot Growth Factors, Partial = o0, Complete = + 
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Fig. 2.1. Pivot growth on random matrices, o = gpp, + = gor. 


Backward error in Gaussian Elimination with Partial Pivoting 
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Backward error in Gaussian Elimination with Complete Pivoting 
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Fig. 2.2. Backward error in Gaussian elimination on random matrices, x = 3n°eq, 


+ = 38n||L] -|U]|loo/I|Alloo, ° = || At — Ollo0/(I!Allool! lec). 
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how many decimal digits of accuracy that you have. (A factor-of-10 error 
is one decimal digit.!°) 


There are a variety of such estimators available (see [144] for a survey). 
We choose to present one that is widely applicable to problems besides solving 
Ax = b, at the cost of being slightly slower than algorithms specialized for 
Ax = b (but it is still reasonably fast). Our estimator, like most others, is 
guaranteed to produce only a lower bound on ||A~+||, not an upper bound. 
Empirically, it is almost always within a factor of 10, and usually 2 to 3, of 
|| A~4||. For the matrices in Figures 2.1 and 2.2, where the condition numbers 
varied from 10 to 10°, the estimator equaled the condition number to several 
decimal places 83% of the time and was .43 times too small at worst. This is 
more than accurate enough to estimate the number of correct decimal digits 
in the final answer. 

The algorithm estimates the one-norm ||B||; of a matrix B, provided that 
we can compute Bz and B’y for arbitrary x and y. We will apply the algorithm 
to B = A`}, so we need to compute A~!x and A~Ty, i.e., solve linear systems. 
This costs just O(n”) given the LU factorization of A. The algorithm was 
developed in [136, 144, 146], with the latest version in [145]. Recall that || Bll, 
is defined by 


B n 
|| Bll, = max Bah = max Y |bjj|- 
x=0 ||x\|1 re 


It is easy to show that the maximum over x = 0 is attained at x = ej, = 
[0,...,0,1,0,...,0]* (the single nonzero entry is component jo, where max; $>; |bi;| 
occurs at j = jo). 


Searching over all e;,7 = 1,...,n means computing all columns of B = 
A“; this is too expensive. Instead, since ||B]ı = maxy,),< ||Bz||1, we can 
use hill climbing or gradient ascent on f(x) = ||Bx||1 inside the set ||x||1 < 1. 


|z||1 < 1 is clearly a convex set of vectors, and f(x) is a convex function, since 
0<a<1 implies f(ax + (1—a)y) = |laBr+ (1—a)Byl]i < alļ|Bz||ı + (1 — 
a)||Byllı = of (x) + (1 — a) f(y). 

Doing gradient ascent to maximize f(x) means moving «x in the direction 
of the gradient Y f(x) (if it exists) as long as f(x) increases. The convexity 
of f(x) means f(y) > f(x) + Vf(xz)-(y— x) (if Vf (x) exists). To compute 
Vf we assume all X`; bijz; = 0 in f(x) = 97; | D2; bij 2,| (this is almost always 
true). Let ¢ = sign(>), bij2j), so G = +1 and f(x) = 97; 50; Gibijxj. Then 
BL = D; Gbir and yf = (7B = (BTT. 

In summary, to compute Y f(x) takes three steps: w = Bz, ¢ = sign(w), 
and vf = fB. 


10 As stated earlier, no one has ever found an estimator that approximates ||A~'|| with some 
guaranteed accuracy and is simultaneously significantly cheaper than explicitly computing 
A~'. It has been been conjectured that no such estimator exists, but this has not been 
proven. 


Linear Equation Solving 53 


ALGORITHM 2.5. Hager’s condition estimator returns a lower bound ||w||1 on 
|| Bll: 


choose any x such that ||x||ı = 1 /* e.g. ti = 4 */ 
repeat 
w = Bz, ¢=sign(w), z = BTÇ P RREN 


if |izlloo < zt a then 
return ||w]|4 


else 
x = ej where |z;| = ||ZIloo 
endif 
end repeat 
THEOREM 2.6. 1. When ||w||1 is returned, ||w||, = ||Bz||ı is a local mazi- 


mum of ||Bzx|jı. 


2. Otherwise, ||Be;|| (at end of loop) > ||Bx|| (at start), so the algorithm 
has made progress in maximizing f(x). 


Proof. 


1. In this case, ||z|loo < z1x. Near z, f(x) = ||Balli = X; do; Sidig2; is 
linear in x so f(y) = f(z) + f(s): (y — z) = f(z) +27 (y — x), where 
zT = Vf (zx). To show z is a local maximum we want 2T (y — x) < 0 when 
|y||1 = 1. We compute 

Zi(y-a2) = zy- zz = So zi -yi — z1r < bD lzi| - lyi| — 27 2 
i i 


Tr = |\z\|o. — z' x <0 as desired. 


IA 


IIZlloo + llull — 2 


2. In this case ||z||.. > 27x. Choose č = ej; - sign(z;), where j is chosen so 
that |z;| = ||Z|loo. Then 
F(Z) 2 f@)+Vf-(@-2) = f(z) +z" @-2) 
= f(a) +2°#—- 2% = f(x) + |zj|— 27 > f(a), 


where the last inequality is true by construction. 


Higham [145, 146] tested a slightly improved version of this algorithm 
by trying many random matrices of sizes 10,25,50 and condition numbers 
k = 10,10?,10°,109; in the worst case the computed « underestimated the 
true « by a factor .44. The algorithm is available in LAPACK as subroutine 
slacon. LAPACK routines like sgesvx call slacon internally and return the 
estimated condition number. (They actually return the reciprocal of the esti- 
mated condition number, to avoid overflow on exactly singular matrices.) A 
different condition estimator is available in Matlab as rcond. The Matlab rou- 
tine cond computes the exact condition number ||A~!||9|| All2, using algorithms 
discussed in section 5.4; it is much more expensive than rcond. 
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Estimating the Relative Condition Number 


We can also use the algorithm from the last section to estimate the relative 
condition number KoR(A) = || |A7?| - |A| ||. from bound (2.8) or to evaluate 
the bound || |A~!]-|r| Ilo from (2.9). We can reduce both to the same problem, 
that of estimating ||| AT} -g ||.., where g is a vector of nonnegative entries. To 
see why, let e be the vector of all ones. From part 5 of Lemma 1.7, we see that 
|X |loo = || Xel if the matrix X has nonnegative entries. Then 


IATH A] [loo = IAT [Ale llo = I||A7*] - glloo, where g= |Ale. 


Here is how we estimate |||A~?| - gll. Let G = diag(gi,...,gn); then 


g = Ge. Thus 
ILAT gll = WIA] Ge loo = IAT G lo = IAG] I 
g |loo e |loo oo ere) 
IAG- (2.12) 
The last equality is true because ||Y || = || |Y| ||. for any matrix Y. Thus, it 


suffices to estimate the infinity norm of the matrix A~'G. We can do this by 
applying Hager’s algorithm, Algorithm 2.5, to the matrix (A~'G)? = GATT, 
to estimate ||(A~!G)" ||; = ||A7!G|.. (see part 6 of Lemma 1.7). This requires 
us to multiply by the matrix GATT and its transpose A~'G. Multiplying by 
G is easy since it is diagonal, and we multiply by A~! and ATT using the LU 
factorization of A, as we did in the last section. 


2.4.4. Practical Error Bounds 


We present two practical error bounds for our approximate solution ĉ of Ax = 
b. For the first bound we use inequality (2.5) to get 


II" Mh 


lêle’ 


where r = Az — b is the residual. We estimate ||A~!||,, by applying Algo- 
rithm 2.5 to B = ATT, estimating ||Bl|1 = IAT] = ||A7"lloo (see parts 5 
and 6 of Lemma 1.7). 

Our second error bound comes from the tighter inequality (2.9): 


eee ie lê = zll% < JAT}: 
lêlls 


(2.13) 


lê -= zll < WAM Irl leo 


lêllo = lêll 


error = (2.14) 


We estimate ||| AT} - |r| Ilo using the algorithm based on equation (2.12). 
Error bound (2.14) (modified as described below in the subsection “What can 
go wrong”) is computed by LAPACK routines like sgesvx. The LAPACK 
variable name for the error bound is FERR, for Forward ERRor. 
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True Error vs. Error Bound, o = GEPP, + = GECP 


Error Bound 


True Error 


Fig. 2.3. Error bound (2.13) plotted versus true error, o = GEPP, + = GEOP. 


EXAMPLE 2.4. We have computed the first error bound (2.13) and the true 
error for the same set of examples as in Figures 2.1 and 2.2, plotting the result 
in Figure 2.3. For each problem Ax = b solved with GEPP we plot a o at 
the point (true error, error bound), and for each problem Ax = b solved with 
GECP we plot a + at the point (true error, error bound). If the error bound 
were equal to the true error, the o or + would lie on the solid diagonal line. 
Since the error bound always exceeds the true error, the os and +s lie above this 
diagonal. When the error bound is less than 10 times larger than the true error, 
the o or + appears between the solid diagonal line and the first superdiagonal 
dashed line. When the error bound is between 10 and 100 times larger than 
the true error, the o or + appears between the first two superdiagonal dashed 
lines. Most error bounds are in this range, with a few error bounds as large 
as 1000 times the true error. Thus, our computed error bound underestimates 
the number of correct decimal digits in the answer by one or two and in rare 
cases by as much as three. The Matlab code for producing these graphs is the 
same as before, HOMEPAGE/Matlab/pivot.m. © 


EXAMPLE 2.5. We present an example chosen to illustrate the difference be- 
tween the two error bounds (2.13) and (2.14). This example will also show 
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that GECP can sometimes be more accurate than GEPP. We choose a set of 
badly scaled examples constructed as follows. Each test matrix is of the form 
A = DB, with the dimension running from 5 to 100. B is equal to an iden- 
tity matrix plus very small random offdiagonal entries, around 1077, so it is 
very well-conditioned. D is a diagonal matrix with entries scaled geometrically 
from 1 up to 10!4. (In other words, di+1i+1/diy is the same for all i.) The 
A matrices have condition numbers «(A) = || At] < |All nearly equal to 
10!4, which is very ill-conditioned, although their relative condition numbers 
KoR(A) = || |A} - [A] ll = || |B7"] -|B| || are all nearly 1. As before, ma- 
chine precision is € = 2-°’ ~ 107°. The examples were computed using the 
same Matlab code HOMEPAGE/Matlab/pivot.m. 

The pivot growth factors gpp and gop were never larger than about 1.33 for 
any example, and the backward error from Theorem 2.2 never exceeded 10715 
in any case. Hager’s estimator was very accurate in all cases, returning the 
true condition number 104 to many decimal places. 

Figure 2.4 plots the error bounds (2.13) and (2.14) for these examples, along 
with the componentwise relative backward error, as given by the formula in 
Theorem 2.3. The cluster of plus signs in the upper left corner of the top 
left graph shows that while GECP computes the answer with a tiny error 
near 10715, the error bound (2.13) is usually closer to 1077, which is very 
pessimistic. This is because the condition number is 1014, and so unless the 
backward error is much smaller than € œ~ 10716, which is unlikely, the error 
bound will be close to 107161014 = 10~?. The cluster of circles in the middle 
top of the same graph shows that GEPP gets a larger error of about 1078, 
while the error bound (2.13) is again usually near 1072. 

In contrast, the error bound (2.14) is nearly perfectly accurate, as illus- 
trated by the pluses and circles on the diagonal in the top right graph of 
Figure 2.4. This graph again illustrates that GECP is nearly perfectly accu- 
rate, whereas GEPP loses about half the accuracy. This difference in accuracy 
is explained by the bottom graph, which shows the componentwise relative 
backward error for GEPP and GECP. This graph makes it clear that GECP 
has nearly perfect backward error in the componentwise relative sense, so since 
the corresponding componentwise relative condition number is 1, the accuracy 
is perfect. GEPP on the other hand is not completely stable in this sense, 
losing from 5 to 10 decimal digits. 

In section 2.5 we show how to iteratively improve the computed solution ĉ. 
One step of this method will make the solution computed by GEPP as accurate 
as the solution from GECP. Since GECP is significantly more expensive than 
GEPP in practice, it is very rarely used. © 


What Can Go Wrong 


Unfortunately, as mentioned in the beginning of section 2.4, error bounds (2.13) 
and (2.14) are not guaranteed to provide tight bounds in all cases when imple- 
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True Error vs. Error Bound (2.13), o = GEPP, + = GECP 


Error Bound 
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True Error vs. Error Bound (2.14), o 


= GEPP, + = GECP 


Error Bound 
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(b) 


Fig. 2.4. (a) plots the error bound (2.13) versus 
bound (2.14) versus the true error. 


10 


the true error. (b) plots the error 
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Componentwise relative backward error, o = GEPP, + = GECP 
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Fig. 2.4. Continued. (c) plots the componentwise relative backward error from The- 
orem 2.3. 


mented in practice. In this section we describe the (rare!) ways they can fail, 
and the partial remedies used in practice. 

First, as described in section 2.4.3, the estimate of ||A~'|| from Algo- 
rithm 2.5 (or similar algorithms) provides only a lower bound, although the 
probability is very low that it is more than 10 times too small. 

Second, there is a small but nonnegligible probability that roundoff in the 
evaluation of r = Aĉ — b might make ||r|| artificially small, in fact zero, and 
so also make our computed error bound too small. To take this possibility 
into account, one can add a small quantity to |r| to account for it: From 
Question 1.10 we know that the roundoff in evaluating r is bounded by 


(Aĉ — b) — A(Aĉ — b)| < (n + 1)e(| A| - || + Ibl), (2.15) 


so we can replace |r| with |r| + (n + 1)e(| A|- |@| + |b|) in bound (2.14) (this is 
done in the LAPACK code sgesvx) or ||r|| with ||r|] + (n+ 1)e(|| A||- |[@|] + Ilbll) 
in bound (2.13). The factor n+ 1 is usually much too large and can be omitted 
if desired. 

Third, roundoff in performing Gaussian elimination on very ill-conditioned 
matrices can yield such inaccurate L and U that bound (2.14) is much too low. 


EXAMPLE 2.6. We present an example, discovered by W. Kahan, that illus- 
trates the difficulties in getting truly guaranteed error bounds. In this example 
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the matrix A will be exactly singular. Therefore the computed error bound on 
ba l should be one or larger to indicate that no digits in the computed solu- 
tion are correct, since the true solution does not exist. 

Roundoff error during Gaussian elimination will yield nonsingular but very 
ill-conditioned factors L and U. With this example, computing using Matlab 
with IEEE double precision arithmetic, the computed residual r turns out to 
be exactly zero because of roundoff, so both error bounds (2.13) and (2.14) 
return zero. If we repair bound (2.13) by adding 4e(|| A|| - |||] + |]b]]), it will be 
larger than 1 as desired. 

Unfortunately our second, “tighter” error bound (2.14) is about 1077, er- 
roneously indicating that seven digits of the computed solution are correct. 

Here is how the example is constructed. Let y = 3/278, ¢ = 214, 


XG =Ç Ç 
AS Pee ot A 
GS. =x gt a 
9.1553 -1075  -—1.6384-104 1.6384 - 104 
x 6.1035 -1075 6.1035 -1075 0 l 


6.1035 -1075 —3.4106 : 10718 6.1035 -1075 


and b = A- [1,1 +.¢,1]’. A can be computed without any roundoff error, 
but b has a bit of roundoff, which means that it is not exactly in the space 
spanned by the columns of A, so Ax = b has no solution. Performing Gaussian 
elimination, we get 
1 0 0 
La~ | .66666 1 0 
.66666 1.0000 1 


and 
9.1553 -1075 -—1.6384-104 1.6384-104 
Ux 0 1.0923-10*  —1.0923-104 |, 
0 0 1.8190 - 1071? 


yielding a computed value of 


2.0480 -103  5.4976-10'! —5.4976 - 101! 
A~ ~ | —2.0480-10? —5.4976-10!! 5.4976 - 1044 
—2.0480- 10° —5.4976 - 1011 5.4976- 10" 
This means the computed value of |A~+| - |A| has all entries approximately 
equal to 6.7109 - 107, so xcr(A) is computed to be O(107). In other words, the 
error bound indicates that about 16 — 7 = 9 digits of the computed solution 
are accurate, whereas none are. 

Barring large pivot growth, one can prove that bound (2.13) (with ||r|| 
appropriately increased) cannot be made artificially small by the phenomenon 
illustrated here. 

Similarly, Kahan has found a family of n-by-n singular matrices, where 
changing one tiny entry (about 27”) to zero lowers K@R(A) to O(n). © 
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2.5. Improving the Accuracy of a Solution 


We have just seen that the error in solving Ax = b may be as large as K(A)e. 
If this error is too large, what can we do? One possibility is to rerun the entire 
computation in higher precision, but this may be quite expensive in time and 
space. Fortunately, as long as «(A) is not too large, there are much cheaper 
methods available for getting a more accurate solution. 


To solve any equation f(x) = 0, we can try to use Newton’s method to 
f(x) 
f'(xi) 


improve an approximate solution x; to get x4, = Ti — . Applying this to 


f(x) = Ax — b yields one step of iterative refinement: 


r= Axz;— b 
solve Ad = r for d 
Tiy =i- d 


If we could compute r = Ax; — b exactly and solve Ad = r exactly, we 
would be done in one step, which is what we expect from Newton applied to 
a linear problem. Roundoff error prevents this immediate convergence. The 
algorithm is interesting and of use precisely when A is so ill-conditioned that 
solving Ad = r (and Axo = b) is rather inaccurate. 


THEOREM 2.7. Suppose that r is computed in double precision and k(A) -€ < 
c= aa <1, where n is the dimension of A and g is the pivot growth factor. 
Then repeated iterative refinement converges with 


læ; — A7*Blloo 


= O(e). 
|| A~*Blloo 


Note that the condition number does not appear in the final error bound. 
This means that we compute the answer accurately independent of the condi- 
tion number, provided that K(A)e is sufficiently less than 1. (In practice, c is 
too conservative an upper bound, and the algorithm often succeeds even when 
k(A)e is greater than c.) 


Sketch of Proof. In order to keep the proof transparent, we will take only 
the most important rounding errors into account. For brevity, we abbreviate 
ll -llo by ||- ||. Our goal is to show that 


lzi — all < AE fo; - zl = Glas — al. 
By assumption, ¢ < 1, so this inequality implies that the error ||zi+ı — 2|| 
decreases monotonically to zero. (In practice it will not decrease all the way 
to zero because of rounding error in the assignment x7;4, = x; — d, which we 
are ignoring.) 

We begin by estimating the error in the computed residual r. We get 
r = fl(Av; — b) = Ax; — b+ f, where by the result of Question 1.10 |f| < 
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ne?(|A| - |x| + |b|) + e| Ax; — b| ~ e|Ax; — b|. The £? term comes from the 
double precision computation of r, and the € term comes from rounding the 
double precision result back to single precision. Since £? < £, we will neglect 
the £? term in the bound on |f|. 

Next we get (A+6A)d = r, where from bound (2.11) we know that || A|| < 
y-e-+||Al], where y = 3n%g, although this is usually much too large. As 
mentioned earlier we simplify matters by assuming zi+1 = x; — d exactly. 

Continuing to ignore all £? terms, we get 


At GA) P= (I + ASA) TA r 
I+ ATSA) AT (Ari — b+ f) 
I+ ASA) (zi — x+ Af) 

I — ATSA) (zi — z + AH f) 

x a-2—-A l6A(aj;—2)+A'f. 


d 


( 
( 
( 
( 


Q 


Therefore xj41 — £ = £; — d — x = ATISA(x; — x) — ATI f and so 


ATSA — 2)|| + AF] 

ATL- ISAI læ: — al] + A>- e Axi — bl 
ATH ISAI - [lei — ell + JAI] e Al = 2) | 
A“ -ye -IAI lle: — zll 

+AT AI €- z: — z| 

= JATH Al e- (y+ 1): lle: zll, 


[Li+ aE z| 


A IA IAIA 


so if 
Ç = || ATH A] -el + 1) = s(A)e/c < 1, 


then we have convergence. 
Iterative refinement (or other variations of Newton’s method) can be used 
to improve accuracy for many other problems of linear algebra as well. 


2.5.1. Single Precision Iterative Refinement 


This section may be skipped on a first reading. 

Sometimes double precision is not available to run iterative refinement. 
For example, if the input data is already in double precision, we would need to 
compute the residual r in quadruple precision, which may not be available. On 
some machines, like the Intel Pentium, double-extended precision is available, 
which provides 11 more bits of fraction than double precision (see section 1.5). 
This is not as accurate as quadruple precision (which would need at least 
2-53 = 106 fraction bits) but still improves the accuracy noticeably. 

But if none of these options are available, one could still run iterative 
refinement while computing the residual r in single precision (i.e., the same 


62 Applied Numerical Linear Algebra 


precision as the input data). In this case, the Theorem 2.7 does not hold 
any more. On the other hand, the following theorem shows that under certain 
technical assumptions, one step of iterative refinement in single precision is still 
worth doing because it reduces the componentwise relative backward error as 
defined in Theorem 2.3 to O(e). If the corresponding relative condition number 
KoR(A) = |||A7?| - JA] llo from section 2.2.1 is significantly smaller than the 
usual condition number «(A) = ||A~!|l.o - || Aloo, then the answer will also be 
more accurate. 


THEOREM 2.8. Suppose that r is computed in single precision and 


max; (|A| + |e); _ 
min;(|A| æl}; 


Then one step of iterative refinement yields xı such that (A+6A)a1 = b+ôb 
with |dai;| = O(e)|aij| and |db;| = O(e)|bi|. In other words, the componentwise 
relative backward error is as small as possible. For example, this means that 
if A and b are sparse, then 6A and ôb have the same sparsity structures as A 
and b, respectively. 


lA" + [Allo ° E <L 


For a proof, see [147] as well as [14, 223, 224, 225] for more details. 
Single precision iterative refinement and the error bound (2.14) are imple- 
mented in LAPACK routines like sgesvx. 


EXAMPLE 2.7. We consider the same matrices as in Example 2.5 and per- 
form one step of iterative refinement in the same precision as the rest of the 
computation (e ~ 10716). For these examples, the usual condition number is 
«(A) = 1014, whereas KcR(A) © 1, so we expect a large accuracy improvement. 
Indeed, the componentwise relative error for GEPP is driven below 10715, and 
the corresponding error from (2.14) is driven below 10715 as well. The Matlab 
code for this example is HOMEPAGE/Matlab/pivot.m. © 


2.5.2. Equilibration 


There is one more common technique for improving the error in solving a linear 
system: equilibration. This refers to choosing an appropriate diagonal matrix 
D and solving DAx = Db instead of Ax = b. D is chosen to try to make the 
condition number of DA smaller than that of A. In Example 2.7 for instance, 
choosing di; to be the reciprocal of the two-norm of row i of A would make DA 
nearly equal to the identity matrix, reducing its condition number from 1014 
to 1. It is possible to show that choosing D this way reduces the condition 
number of DA to within a factor of y/n of its smallest possible value for any 
diagonal D [242]. In practice we may also choose two diagonal matrices Drow 
and Decoy and solve (DyowADeo1)E = Drowb, £ = Deot. 

The techniques of iterative refinement and equilibration are implemented 
in the LAPACK subroutines like sgerfs and sgeequ, respectively. These are 
in turn used by driver routines like sgesvx. 
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2.6. Blocking Algorithms for Higher Performance 


At the end of section 2.3, we said that changing the order of the three nested 
loops in the implementation of Gaussian elimination in Algorithm 2.2 could 
change the execution speed by orders of magnitude, depending on the computer 
and the problem being solved. In this section we will explore why this is the 
case and describe some carefully written linear algebra software which takes 
these matters into account. These implementations use so-called block algo- 
rithms, because they operate on square or rectangular subblocks of matrices in 
their innermost loops rather than on entire rows or columns. These codes are 
available in public-domain software libraries such as LAPACK (in Fortran, at 
NETLIB/lapack)!* and ScaLAPACK (at NETLIB/scalapack). LAPACK (and 
its versions in other languages) are suitable for PCs, workstations, vector com- 
puters, and shared-memory parallel computers. These include the SUN SPAR- 
Ccenter 2000 [236], SGI Power Challenge [221], DEC AlphaServer 8400 [101], 
and Cray C90/J90 [251, 252]. ScaLAPACK is suitable for distributed-memory 
parallel computers, such as the IBM SP-2 [254], Intel Paragon [255], Cray T3 
series [253], and networks of workstations [9]. These libraries are available on 
NETLIB, including a comprehensive manual [10]. 


A more comprehensive discussion of algorithms for high performance (es- 
pecially parallel) machines may be found on the World Wide Web at PARAL- 
LEL-HOMEPAGE. 


LAPACK was originally motivated by the poor performance of its prede- 
cessors LINPACK and EISPACK (also available on NETLIB) on some high- 
performance machines. For example, consider the table below, which presents 
the speed in Mflops of LINPACK’s Cholesky routine spofa on a Cray YMP, a 
supercomputer of the late 1980s. Cholesky is a variant of Gaussian elimination 
suitable for symmetric positive definite matrices. It is discussed in depth in 
section 2.7; here it suffices to know that it is very similar to Algorithm 2.2. The 
table also includes the speed of several other linear algebra operations. The 
Cray YMP is a parallel computer with up to 8 processors that can be used 
simultaneously, so we include one column of data for 1 processor and another 
column where all 8 processors are used. 


11A C translation of LAPACK, called CLAPACK (at NETLIB/clapack), is also available. 
LAPACK++ (at NETLIB/c++/lapack++)) and LAPACK90 (at NETLIB/lapack90)) are 
C++ and Fortran 90 interfaces to LAPACK, respectively. 
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1 Proc. 8 Procs. 


Maximum speed 330 2640 
Matrix-matrix multiply (n = 500) 312 2425 
Matrix-vector multiply (n = 500) 311 2285 
Solve TX = B (n = 500) 309 2398 
Solve Tz = b (n = 500) 272 584 
LINPACK (Cholesky, n = 500) 72 72 
LAPACK (Cholesky, n = 500) 290 1414 
LAPACK (Cholesky, n = 1000) 301 2115 


The top line, the maximum speed of the machine, is an upper bound on 
the numbers that follow. The basic linear algebra operations on the next four 
lines have been measured using subroutines especially designed for high speed 
on the Cray YMP. They all get reasonably close to the maximum possible 
speed, except for solving Tz = b, a single triangular system of linear equations, 
which does not use 8 processors effectively. Solving TX = B refers to solving 
triangular systems with many right-hand sides (B is a square matrix). These 
numbers are for large matrices and vectors (n = 500). 

The Cholesky routine from LINPACK in the sixth line of the table executes 
significantly more slowly than these other operations, even though it is working 
on as large a matrix as the previous operations and doing mathematically sim- 
ilar operations. This poor performance leads us to try to reorganize Cholesky 
and other linear algebra routines to go as fast as their simpler counterparts 
like matrix-matrix multiplication. The speeds of these reorganized codes from 
LAPACK are given in the last two lines of the table. It is apparent that the 
LAPACK routines come much closer to the maximum speed of the machine. 
We emphasize that the LAPACK and LINPACK Cholesky routines perform 
the same floating operations, but in a different order. 

To understand how these speedups were attained, we must understand how 
the time is spent by the computer while executing. This in turn requires us to 
understand how computer memories operate. It turns out that all computer 
memories, from the cheapest personal computer to the biggest supercomputer, 
are built as hierarchies, with a series of different kinds of memories ranging from 
very fast, expensive, and therefore small memory at the top of the hierarchy 
down to slow, cheap, and very large memory at the bottom. 


Fast, small, expensive Registers 


Cache 


Memory 


Disk 


Slow, large, cheap Tape 
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For example, registers form the fastest memory, then cache, main memory, 
disks, and finally tape as the slowest, largest, and cheapest. Useful arithmetic 
and logical operations can be done only on data at the top of the hierarchy, in 
the registers. Data at one level of the memory hierarchy can move to adjacent 
levels—for example, moving between main memory and disk. The speed at 
which data moves is high near the top of the hierarchy (between registers and 
cache) and slow near the bottom (between and disk and main memory). In 
particular, the speed at which arithmetic is done is much faster than the speed 
at which data is transferred between lower levels in the memory hierarchy, by 
factors of 10s or even 10000s, depending on the level. This means that an ill- 
designed algorithm may spend most of its time moving data from the bottom 
of the memory hierarchy to the registers in order to perform useful work rather 
than actually doing the work. 

Here is an example of a simple algorithm which unfortunately cannot avoid 
spending most of its time moving data rather than doing useful arithmetic. 
Suppose that we want to add two large n-by-n matrices, large enough so that 
they fit only in a large, slow level of the memory hierarchy. To add them, they 
must be be transferred a piece at a time up to the registers to do the additions, 
and the sums are transferred back down. Thus, there are exactly 3 memory 
transfers between fast and slow memory (reading 2 summands into fast memory 
and writing 1 sum back to slow memory) for every addition performed. If the 
time to do a floating point operation is tarith seconds and the time to move a 
word of data between memory levels is tmem seconds, where tmem >> tarith, then 
the execution time of this algorithm is n?(tarjth + 3tmem), which is much larger 
than than the time n7tarith required for the arithmetic alone. This means that 
matrix addition is doomed to run at the speed of the slowest level of memory 
in which the matrices reside, rather than the much higher speed of addition. 
In contrast, we will see later that other operations, such as matrix-matrix 
multiplication, can be made to run at the speed of the fastest level of the 
memory, even if the data are originally stored in the slowest. 

LINPACK’s Cholesky routine runs so slowly because it was not designed 
to minimize memory movement on machines such as the Cray YMP.!? In con- 
trast, matrix-matrix multiplication and the three other basic linear algebra 
algorithms measured in the table were specialized to minimize data movement 
on a Cray YMP. 


2.6.1. Basic Linear Algebra Subroutines (BLAS) 


Since it is not cost-effective to write a special version of every routine like 
Cholesky for every new computer, we need a more systematic approach. Since 
operations like matrix-matrix multiplication are so common, computer manu- 
facturers have standardized them as the Basic Linear Algebra Subroutines, or 


12Tt was designed to reduce another kind of memory movement, page faults between main 
memory and disk. 
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BLAS |167, 87, 85], and optimized them for their machines. In other words, 
a library of subroutines for matrix-matrix multiplication, matrix-vector multi- 
plication, and other similar operations is available with a standard Fortran or 
C interface on high performance machines (and many others), but underneath 
they have been optimized for each machine. Our goal is to take advantage of 
these optimized BLAS by reorganizing algorithms like Cholesky so that they 
call the BLAS to perform most of their work. 

In this section we will discuss the BLAS in general. In section 2.6.2, we 
will describe how to optimize matrix multiplication in particular. Finally, in 
section 2.6.3, we show how to reorganize Gaussian elimination so that most of 
its work is performed using matrix multiplication. 

Let us examine the BLAS more carefully. Table 2.1 counts the number of 
memory references and floating points operations performed by three related 
BLAS. For example, the number of memory references needed to implement 
the saxpy operation in line 1 of the table is 3n + 1, because we need to read 
n values of x;, n values of y;, and 1 value of a from slow memory to registers, 
and then write n values of y; back to slow memory. The last column gives the 
ratio q of flops to memory references (its highest-order term in n only). 

The significance of q is that it tells us roughly how many flops that we can 
perform per memory reference or how much useful work we can do compared to 
the time moving data. This tells us how fast the algorithm can potentially run. 
For example, suppose that an algorithm performs f floating points operations, 
each of which takes tarith seconds, and m memory references, each of which 
takes tmem seconds. Then the total running time is as large as 


m tme It 
f : tarith + M+ mem = f ‘ tarith * (1 le ee | = oh -arith ` (1 gies nem) , 
f tarith q tarith 


assuming that the arithmetic and memory references are not performed in 
parallel. Therefore, the larger the value of q, the closer the running time is to 
the best possible running time f - taritn, which is how long the algorithm would 
take if all data were in registers. This means that algorithms with the larger 
q values are better building blocks for other algorithms. 

Table 2.1 reflects a hierarchy of operations: Operations such as saxpy 
perform O(n!) flops on vectors and offer the worst q values; these are called 
Level 1 BLAS, or BLAS1 [167], and include inner products, multiplying a 
scalar times a vector and other simple operations. Operations such as matrix- 
vector multiplication perform O(n?) flops on matrices and vectors and offer 
slightly better q values; these are called Level 2 BLAS, or BLAS2 [87, 86], 
and include solving triangular systems of equations and rank-1 updates of 
matrices (A + xy’, 2 and y column vectors). Operations such as matrix- 
matrix multiplication perform O(n?) flops on pairs of matrices, and offer the 
best q values; these are called Level 3 BLAS, or BLAS3 [85, 84], and include 
solving triangular systems of equations with many right-hand sides. 

The directory NETLIB/blas includes documentation and (unoptimized) 
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Operation Definition f m q= f/m 
saxpy y=a-:nrt+yor 2n | 3n+1 2/3 
(BLAS1) Yi = AX, + Yi 
t= hn 
Matrix-vector mult | y= A-x+y or 2n? | n? + 3n 2 
(BLAS2) Yi = X= Qij£j + Yi 
Taden 
Matrix-matrix mult | C = A. B+C or 2n’ 4n? n/2 
(BLAS3) Cij = Jrj Qikbjk + Cij 
wS hasr 


Table 2.1. Counting floating point operations and memory references for the BLAS. f 
is the number of floating point operations, and m is the number of memory references. 


implementations of all the BLAS. For a quick summary of all the BLAS, see 
NETLIB/blas/blasqr.ps. This summary also appears in [10, App. C] (or 
NETLIB/lapack/lug/lapack_lug.html). 

Since the Level 3 BLAS have the highest q values, we endeavor to reorganize 
our algorithms in terms of operations such as matrix-matrix multiplication 
rather than saxpy or matrix-vector multiplication. (LINPACK’s Cholesky is 
constructed in terms of calls to saxpy.) We emphasize that such reorganized 
algorithms will only be faster when using BLAS that have been optimized. 


2.6.2. How to Optimize Matrix Multiplication 


Let us examine in detail how to implement matrix multiplication C = A-B+C 
to minimize the number of memory moves and so optimize its performance. 
We will see that the performance is sensitive to the implementation details. To 
simplify our discussion, we will use the following machine model. We assume 
that matrices are stored columnwise, as in Fortran. (It is easy to modify the 
examples below if matrices are stored rowwise as in C.) We assume that there 
are two levels of memory hierarchy, fast and slow, where the slow memory 
is large enough to contain the three n x n matrices A, B, and C, but the 
fast memory contains only M words where 2n < M <« n?; this means that 
the fast memory is large enough to hold two matrix columns or rows but 
not a whole matrix. We further assume that the data movement is under 
programmer control. (In practice, data movement may be done automatically 
by hardware, such as the cache controller. Nonetheless, the basic optimization 
scheme remains the same.) 

The simplest matrix-multiplication algorithm that one might try consists of 
three nested loops, which we have annotated to indicate the data movements. 


ALGORITHM 2.6. Unblocked matrix multiplication (annotated to indicate mem- 
ory activity): 
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fori=1 ton 
{ Read row i of A into fast memory } 
forj=l1 ton 
{ Read Ci; into fast memory } 
{ Read column j of B into fast memory } 
fork=1 ton 
Cij = Cij + Aik Bry 
end for 
{ Write Ci; back to slow memory } 
end for 
end for 


The innermost loop is doing a dot product of row i of A and column j of B to 
compute Cj;, as shown in the following figure: 


Ş BC, j) 


One can also describe the two innermost loops (on j and k) as doing a 
vector-matrix multiplication of the ith row of A times the matrix B to get the 
ith row of C. This is a hint that we will not perform any better than these 
BLAS1 and BLAS2 operations, since they are within the innermost loops. 


Here is the detailed count of memory references: n° for reading B n times 
(once for each value of i); n? for reading A one row at a time and keeping it in 
fast memory until it is no longer needed; and 2n? for reading one entry of C 
at a time, keeping it in fast memory until it is completely computed, and then 
moving it back to slow memory. This comes to n° + 3n? memory moves, or 
q = 2n3/(n?+3n?) ~ 2, which is no better than the Level 2 BLAS and far from 
the maximum possible n/2 (see Table 2.1). If M <n, so that we cannot keep 
a full row of A in fast memory, q further decreases to 1, since the algorithm 
reduces to a sequence of inner products, which are Level 1 BLAS. For every 
permutation of the three loops on i, j, and k, one gets another algorithm with 
q about the same. 


Our preferred algorithm uses blocking, where C is broken into an N x 
N block matrix with n/N x n/N blocks C*, and A and B are similarly 
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partitioned, as shown below for N = 4. The algorithm becomes 


kj 


N 
cË =] ci) +E a eB 
k=1 


ALGORITHM 2.7. Blocked matrix multiplication (annotated to indicate mem- 
ory activity): 


fori=1 to N 
forj=ltoN 
{ Read C” into fast memory } 
fork=1 to N 
{ Read A into fast memory } 
{ Read B® into fast memory } 
Ci _ Cu He Aik 3 Bk 
end for 
{ Write CY back to slow memory } 
end for 
end for 


Our memory reference count is as follows: 2n? for reading and writing 
each block of C once, Nn? for reading A N times (reading each n/N-by-n/N 
submatrix A N3 times), and Nn? for reading B N times (reading each n/N- 
by-n/N submatrix B*Í N? times), for a total of (2N + 2)n? ~ 2Nn? memory 
references. So we want to choose N as small as possible to minimize the num- 
ber of memory references. But N is subject to the constraint M > 3(n/N)?, 
which means that one block each from A, B, and C must fit in fast memory 
simultaneously. This yields N ~ n\/3/M, and so q © (2n?)/(2Nn?) ~ \/M/3, 
which is much better than the previous algorithm. In particular q grows in- 
dependently of n as M grows, which means that we expect the algorithm to 
be fast for any matrix size n and to go faster if the fast memory size M is 
increased. These are both attractive properties. 

In fact, it can be shown that Algorithm 2.7 is asymptotically optimal [149]. 
In other words, no reorganization of matrix-matrix multiplication (that per- 
forms the same 2n arithmetic operations) can have a q larger than O(V M). 

On the other hand, this brief analysis ignores a number of practical issues: 


1. A real code will have to deal with nonsquare matrices, for which the 
optimal block sizes may not be square. 
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RS2: Level 1, 2 and 3 BLAS 
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Fig. 2.5. BLAS speed on the IBM RS 6000/590. 


2. The cache and register structure of a machine will strongly affect the 
best shapes of submatrices. 


3. There may be special hardware instructions that perform both a multiply 
and an addition in one cycle. It may also be possible to execute several 
multiply-add operations simultaneously if they do not interfere. 


For a detailed discussion of these issues for one high performance workstation, 
the IBM RS6000/590, see [1], PARALLEL_HOMEPAGE, or http://www.austin.ibm.com/tech/ 
Figure 2.6.2 shows the speeds of the three basic BLAS for this machine. The 
horizontal axis is matrix size, and the vertical axis is speed in Mflops. The peak 
machine speed is 266 Mflops. The top curve (peaking near 250 Mflops) is square 
matrix-matrix multiplication. The middle curve (peaking near 100 Mflops) is 
square matrix-vector multiplication, and the bottom curve (peaking near 75 
Mflops) is saxpy. Note that the speed increases for larger matrices. This is a 
common phenomenon and means that we will try to develop algorithms whose 
internal matrix-multiplications use as large matrices as reasonable. 

Both the above matrix-matrix multiplication algorithms perform 2n° arith- 
metic operations. It turns out that there are other implementations of matrix- 
matrix multiplication that use far fewer operations. Strassen’s method [3] was 
the first of these algorithms to be discovered and is the simplest to explain. 
This algorithm multiplies matrices recursively by dividing them into 2 x 2 block 
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matrices and multiplying the subblocks using seven matrix multiplications (re- 
cursively) and 18 matrix additions of half the size; this leads to an asymptotic 
complexity of n°827 ~ n?8! instead of n°. 


ALGORITHM 2.8. Strassen’s matrix multiplication algorithm 


C = Strassen(A,B,n) 
/* Return C = Ax B, where A and B are n-by-n; 
Assume n is a power of 2 */ 
ifn=1 
return C=AxB /* scalar multiplication */ 


else 
PA Ai Alp By Biz 
Partition A = | An A» | and B = | Ba B» | 
where the subblocks Aj; and By are n/2-by-n/2 
Pı = Strassen( Ajo nak Ago, Boy T Bog, n/2 ) 
P> = Strassen( Ay, + A22, B11 + Bog, n/2 ) 
P; = Strassen( Ai. — A21, Bi1 + Biz, n/2 ) 
P, = Strassen( A11 + A12, B22, n/2 ) 
Ps = Strassen( A11, Big — Boz, n/2 ) 
Ps = Strassen( Ag2, Bo — B11, n/2 ) 
Pr = Strassen( Ag, + A22, B11, n/2 ) 
Cy = P + Po — Pi + Pe 
Cio = P4 + Ps 
C2 = Pp — P3 + Ps — Pr 
Cir Che 
return C = | Cs C55 | 
end if 


It is tedious but straightforward to confirm by induction that this algorithm 
multiplies matrices correctly (see Question 2.21). To show that its complexity 
is O(n!°827), we let T(n) be the number of additions, subtractions, and multi- 
plies performed by the algorithm. Since the algorithm performs seven recursive 
calls on matrices of size n/2, and 18 additions of n/2-by-n/2 matrices, we can 
write down the recurrence T(n) = 77(n/2) + 18(n/2)?. Changing variables 
from n to m = logy n, we get a new recurrence T(m) = 77(m—1) +18(2™~*)?, 
where T(m) = T(2™). We can confirm that this linear recurrence for T has a 
solution T(m) = O(7™) = O(n!827), 

The value of Strassen’s algorithm is not just this asymptotic complexity 
but its reduction of the problem to smaller subproblems which eventually fit 
in fast memory; once the subproblems fit in fast memory, standard matrix 
multiplication may be used. This approach has led to speedups on relatively 
large matrices on some machines [22]. A drawback is the need for significant 
workspace and somewhat lower numerical stability, although it is adequate for 
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many purposes [76]. There are a number of other even faster matrix multipli- 
cation algorithms; the current record is about O(n®376), due to Winograd and 
Coppersmith [261]. But these algorithms only perform fewer operations than 
Strassen for impractically large values of n. For a survey see [193]. 


2.6.3. Reorganizing Gaussian Elimination to use Level 3 BLAS 


We will reorganize Gaussian elimination to use, first, the Level 2 BLAS and, 
then, the Level 3 BLAS. For simplicity, we assume that no pivoting is necessary. 
Indeed, Algorithm 2.4 is already a Level 2 BLAS algorithm, because most 
of the work is done in the second line, A(i +1: n,i+1:n) = Ali+1: 
n,i +1: n)— Ali+1: n,i) * A(i,i +1 : n), which is a rank-1 update of 
the submatrix A(¢+1:n,i+1:n). The other arithmetic in the algorithm, 
Ali+ 1: n,i) = A(i+1: n,2)/A(i,2), is actually done by multiplying the 
vector A(i+ 1: n,i) by the scalar 1/A(i, i), since multiplication is much faster 
than division; this is also a Level 1 BLAS operation. We need to modify 
Algorithm 2.4 slightly because we will use it within the Level 3 version: 


ALGORITHM 2.9. Level 2 BLAS implementation of LU factorization without 
pivoting for an m-by-n matrix A, where m > n: Overwrite A by the m-by-n 
matric L and m-by-m matrix U. We have numbered the important lines for 
later reference. 


fori =1 to min(m —1,n) 
(1) A(Zi+1:m,i) = A(i+1:m,i)/A(i, i) 
ifi<n 
(2) AQ@it1:m,i+1:n)=AG4+1:m,i+1:n)- 
Ali+1:m,i)- Ali,i+1:n) 
end for 


The left side of Figure 2.6 illustrates Algorithm 2.9 applied to a square 
matrix. At step i of the algorithm, columns 1 to i — 1 of L and rows 1 to i— 1 
of U are already done, column i of L and row i of U are to be computed, and 
the trailing submatrix of A is to be updated by a rank-1 update. On the left 
side of Figure 2.6, the submatrices are labeled by the lines of the algorithm 
((1) or (2)) that update them. The rank-1 update in line (2) is to subtract the 
product of the shaded column and the shaded row from the submatrix labeled 
(2). 

The Level 3 BLAS algorithm will reorganize this computation by delaying 
the update of submatrix (2) for b steps, where b is a small integer called the 
block size, and later applying b rank-1 updates all at once in a single matrix- 
matrix multiplication. To see how to do this, suppose that we have already 
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U (done) 


Step i of Level 2 BLAS Step i of Level 3 BLAS 
Implementation of LU Implementation of LU 


Fig. 2.6. Level 2 and Level 3 BLAS implementations of LU factorization. 


computed the first i — 1 columns of L and rows of U, yielding 


t-1 b n-b-i+1 


Gad Ai Ag Ai3 
A = b Aoi A22 A23 
n—-b-i+l1 A31 A32 A33 
Lu 0 0 Uy, Uz U31 
= La I 0|-| 0 Ago åz | , 
L31 0 I 0 is, Äss 


where all the matrices are partitioned the same way. This is shown on the 
right side of Figure 2.6. Now apply Algorithm 2.9 to the submatrix [ o ] to 
get 


Ag. | _ | Loe Tns l 
Lo L23U 22 | 


This lets us write 


| Ago Ag3 | = | L22U22 A23 | 
A32 A33 L32U22 A33 
= | Lo 0 | | Uz L3} Ao3 - | 
L32 I 0  As3 — Lge + (L33 A23) 


Fe Beles Ua | 
L32 I 0 A33 — L32 U23 
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E L32 I 
Altogether, we get an updated factorization with b more columns of L and 
rows of U completed: 


U22 U23 
Oo Aas 


Ai Ajo A13 Lii 0 0 Uy U21 U31 
Ag, Az Az |= | La Lo 0 |-| 0 Uz Uz 
A31 A32 A33 L3ı Lə I 0 0 Åz 


This defines an algorithm with the following three steps, which are illus- 
trated on the right of Figure 2.6: 


: . Ago 3 _ p Lo 
(1) Use Algorithm 2.9 to factorize [ re l= La 


(2) Form Uz3 = L355 Ao3. This means solving a triangular linear 
system with many right-hand sides (A23), a single Level 3 BLAS 
operation. | 


| - Uz2. 


(3) Form Ag3 = A33 — L32 - U23, a matrix-matrix multiplication. 
More formally, we have the following algorithms. 


ALGORITHM 2.10. Level 3 BLAS implementation of LU factorization without 
pivoting for an n-by-n matrix A. Overwrite L and U on A. The lines of 
the algorithm are numbered as above and to correspond to the right part of 
Figure 2.6. 


fori=1 ton-—1 step b 
(1) Use Algorithm 2.9 to factorize A(i:n,i:i+b-1) =[ 
(2)  Ali:i+b-1,i+b:n)= Lz Ali:i+b—-1,i+b:n) 
/* form Uz3 */ 
(3) Ali+b:n,i+b:n)=Ali+b:n,i+b:n) 
—Aļli+b:n,i:i+b-1)-Aļli:i+b-1,i+b:n) 


Fi form Ags wa 


Loz 


ies ] U22 


end for 


We still need to choose the block size b in order to maximize the speed of 
the algorithm. On the one hand, we would like to make b large because we 
have seen that speed increases when multiplying larger matrices. On the other 
hand, we can verify that the number of floating point operations performed 
by the slower Level 2 and Level 1 BLAS in line (1) of the algorithm is about 
n*b/2 for small b, which grows as b grows, so we do not want to pick b too 
large. The optimal value of b is machine dependent and can be tuned for each 
machine. Values of b = 32 or b = 64 are commonly used. 
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To see detailed implementations of Algorithms 2.9 and 2.10, see subrou- 
tines sgetf2 and sgetrf, respectively, in LAPACK (NETLIB/lapack). For 
more information on block algorithms, including detailed performance num- 
ber on a variety of machines, see also [10] or the course notes at PARAL- 
LEL HOMEPAGE. 


2.6.4. More About Parallelism and Other Performance Issues 


In this section we briefly survey other issues involved in implementing Gaussian 
elimination (and other linear algebra routines) as efficiently as possible. 

A parallel computer contains p > 1 processors capable of simultaneously 
working on the same problem. One may hope to solve any given problem 
p times faster on such a machine than on a conventional uniprocessor. But 
such “perfect efficiency” is rarely achieved, even if there are always at least 
p independent tasks available to do, because of the overhead of coordinating 
p processors and the cost of sending data from the processor that may store 
it to the processor that needs it. This last problem is another example of 
a memory hierarchy: from the point of view of processor i, its own memory 
is fast, but getting data from the memory owned by processor j is slower, 
sometimes thousands of times slower. 

Gaussian elimination offers many opportunities for parallelism, since each 
entry of the trailing submatrix may be updated independently and in parallel 
at each step. But some care is needed to be as efficient as possible. Two stan- 
dard pieces of software are available. The LAPACK routine sgetrf described 
in the last section [10] runs on shared-memory parallel machines, provided 
that one has available implementations of the BLAS that run in parallel. A 
related library called ScaLAPACK, for Scalable LAPACK [52], is designed for 
distributed-memory parallel machines, i.e., those that require special operations 
to move data between different processors. All software is available on NETLIB 
in the LAPACK and ScaLAPACK subdirectories. ScaLAPACK is described in 
more detail in the notes at PARALLEL HOMEPAGE. Extensive performance 
data for linear equation solvers are available as the LINPACK Benchmark [83], 
with an up-to-date version available at NETLIB/benchmark/performance.ps, 
or in the Performance Database Server.!? As of August 1996, the fastest that 
any linear system had been solved using Gaussian elimination was one with 
n = 128600 on an Intel Paragon XP/S MP with p = 6768 processors; the 
problem ran at just over 281 Gflops (gigaflops), of a maximum 338 Gflops. 

There are some matrices too large to fit in the main memory of any avail- 
able machine. These matrices are stored on disk and must be read into main 
memory piece by piece in order to perform Gaussian elimination. The orga- 
nization of such routines is largely similar to the technique currently used in 
ScaLAPACK, and they will soon be included in ScaLAPACK. 


'Shttp://performance.netlib.org/performance/html/PDStop.html 
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Finally, one might hope that compilers would become sufficiently clever to 
take the simplest implementation of Gaussian elimination using three nested 
loops and automatically “optimize” the code to look like the blocked algorithm 
discussed in the last subsection. While there is much current research on this 
topic (see the bibliography in the recent compiler textbook [262]), there is 
still no reliably fast alternative to optimized libraries such as LAPACK and 
ScaLAPACK. 


2.7. Special Linear Systems 


As mentioned in section 1.2, it is important to exploit any special structure 
of the matrix to increase speed of solution and decrease storage. In practice, 
of course, the cost of the extra programming effort required to exploit this 
structure must be taken into account. For example, if our only goal is to 
minimize the time to get the desired solution, and it takes an extra week of 
programming effort to decrease the solution time from 10 seconds to 1 second, 
it is worth doing only if we are going to use the routine more than (1 week * 
7 days/week * 24 hours/day * 3600 seconds/hour) / (10 seconds — 1 second) 
= 67200 times. Fortunately, there are some special structures that turn up 
frequently enough that standard solutions exist, and we should certainly use 
them. The ones we consider here are 


1. symmetric positive definite matrices, 

2. symmetric indefinite matrices, 

3. band matrices, 

4. general sparse matrices, 

5. dense matrices depending on fewer than n? independent parameters. 


We will consider only real matrices; extensions to complex matrices are straight- 
forward. 


2.7.1. Real Symmetric Positive Definite Matrices 


Recall that a real matrix A is s.p.d. if and only if A = A? and zT Ax > 0 for 
all x = 0. In this section we will show how to solve Ax = b in half the time 
and half the space of Gaussian elimination when A is s.p.d. 


PROPOSITION 2.2. 1. If X is nonsingular, then A is s.p.d. if and only if 
XT AX is s.p.d. 


2. If A is s.p.d. and H is any principal submatrix of A (H = A(j : k,j: k) 
for some j < k), then H is s.p.d. 
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3. 


4. 


5. 


A is s.p.d. if and only if A = AT and all its eigenvalues are positive. 
If A is s.p.d., then all ay > 0, and maxi; |aij| = max; ai; > 0. 


A is s.p.d. if and only if there is a unique lower triangular nonsingular 
matriz L, with positive diagonal entries, such that A = LLT. A = LLT 
is called the Cholesky factorization of A, and L is called the Cholesky 
factor of A. 


Proof. 


il 


X nonsingular implies Xa = 0 for all z = 0, so 27 X7AXz > 0 for all 
x =0. So As.p.d. implies X7 AX is s.p.d. Use X~! to deduce the other 
implication. 


. Suppose first that H = A(1:m,1:m). Then given any m-vector y, the 


n-vector x = [y?,0]? satisfies yt Hy = x? Ax. So if x? Ax > 0 for all 
nonzero x, then yf Hy > 0 for all nonzero y, and so H is s.p.d. If H does 
not lie in the upper left corner of A, let P be a permutation so that H 
does lie in the upper left corner of PT AP and apply Part 1. 


. Let X be the real, orthogonal eigenvector matrix of A so that XTAX = A 


is the diagonal matrix of real eigenvalues A;. Since a7 Ax = >, \;x?, A 
is s.p.d if and only if each A; > 0. Now apply Part 1. 


. Let e; be the ith column of the identity matrix. Then ef Ae; = a, > 0 


for all i. If Jagi| = max;; |a;;| but k = l, choose x = ep — sign(agu)ez. 
Then a? Ax = apk + ay — 2|axi| < 0, contradicting positive-definiteness. 


. Suppose A = LL? with L nonsingular. Then 2? Ax = (a7 L)(L7 x) = 


||L72||2 > 0 for all z = 0, so A is s.p.d. If A is s.p.d., we show that L 
exists by induction on the dimension n. If we choose each li > 0, our 
construction will determine L uniquely. If n = 1, choose hı = yan, 
which exists since a3, > 0. As with Gaussian elimination, it suffices to 
understand the block 2-by-2 case. Write 


a11 
A = 
| Al, Ax% | 


~ T 
so the (n — 1)-by-(n — 1) matrix A22 = A22 — Ea is symmetric. 
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By Part 1 above, [ ; he ] is s.p.d, so by Part 2 Ago is s.p.d. 


Thus by induction there exists an L such that Ag. = LLT and 
ee A | 0 | k a 
I 


AL. TT 
Jar I 0 LL 0 


laa A 
= Ve 3 ai Jar = i 
Jar 0 Ph 


A 


We may rewrite this induction as the following algorithm. 
ALGORITHM 2.11. Cholesky algorithm: 


forj=lton 
re Cree S a 
fori=j+1 ton 
lig = (aig — YÍ] Lindy) [Lys 
end for 
end for 


If A is not positive definite, then (in exact arithmetic) this algorithm will 
fail by attempting to compute the square root of a negative number or by 
dividing by zero; this is the cheapest way to test if a symmetric matrix is 
positive definite. 

As with Gaussian elimination, L can overwrite the lower half of A. Only 
the lower half of A is referred to by the algorithm, so in fact only n(n + 1)/2 
storage is needed instead of n?. The number of flops is 


S (23 + > 2j) = =n? + O(n”), 
j=l i=j+1 


or just half the flops of Gaussian elimination. Just as with Gaussian elim- 
ination, Cholesky may be reorganized to perform most of its floating point 
operations using Level 3 BLAS; see LAPACK routine spotrf. 

Pivoting is not necessary for Cholesky to be numerically stable (equiva- 
lently, we could also say any pivot order is numerically stable). We show this 
as follows. The same analysis as for Gaussian elimination in section 2.4.2 shows 
that the computed solution ĉ satisfies (A+6A)% = b with |6A| < 3ne|L|-|L7|. 
But by the Cauchy—Schwartz inequality and Part 4 of Proposition 2.2 


(LEI [EFD = $ lil Ilj 
k 


max |a;j|, (2.16) 
a 


IA 
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so || |Z|-|Z*| ll < n||Alloo and ||5Alloo < 3n7¢l| Aloo. 


2.7.2. Symmetric Indefinite Matrices 


The question of whether we can still save half the time and half the space 
when solving a symmetric but indefinite (neither positive definite nor negative 
definite) linear system naturally arises. It turns out to be possible, but a more 
complicated pivoting scheme and factorization is required. If A is nonsingular, 
one can show that there exists a permutation P, a unit lower triangular matrix 
L, and a block diagonal matrix D with 1-by-1 and 2-by-2 blocks such that 


PAP’ = LDL’. To see why 2-by-2 blocks are needed in D, consider the 
0 1 
1 a! 
the work and space compared to standard Gaussian elimination. The name of 
the LAPACK subroutine which does this operation is ssysv. The algorithm 
is described in [43]. 


matrix | . This factorization can be computed stably, saving about half 


2.7.3. Band Matrices 


A matrix A is called a band matrix with lower bandwidth br, and upper band- 
width by if aij = 0 whenever i > j + by or i < j — bu: 


a11 a Ql by+1 0 
Q2 by +2 
A = Qbr +1,1 
Qbr +2,2 an—by ,n 
0 An, n-br `? ann 


Band matrices arise often in practice (we give an example later) and are 
useful to recognize because their L and U factors are also “essentially banded,” 
making them cheaper to compute and store. We explain what we mean by 
“essentially banded” below. But first, we consider LU factorization without 
pivoting and show that L and U are banded in the usual sense, with the same 
bandwidths as A. 


PROPOSITION 2.3. Let A be banded with lower bandwidth br, and upper band- 
width by. Let A = LU be computed without pivoting. Then L has lower 
bandwidth br, and U has upper bandwidth by. L and U can be computed in 
about 2n- by - br arithmetic operations when by and by are small compared to 
n. The space needed is N(br, + by +1). The full cost of solving Ax = b is 
2nby -br + 2nby + 2nbp. 


Sketch of Proof. It suffices to look at one step; see Figure 2.7. At step j of 
Gaussian elimination, the shaded region is modified by subtracting the product 
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Fig. 2.7. Band LU factorization without pivoting. 


of the first column and first row of the shaded region; note that this does not 
enlarge the bandwidth. 


PROPOSITION 2.4. Let A be banded with lower bandwidth br, and upper band- 
width by. Then after Gaussian elimination with partial pivoting, U is banded 
with upper bandwidth at most br, +by, and L is “essentially banded” with lower 
bandwidth br. This means that L has at most br, +1 nonzeros in each column 


and so can be stored in the same space as a bandmatrix with lower bandwidth 
br. 


Sketch of Proof. Again a picture of the region changed by one step of the 
algorithm illustrates the proof. As illustrated in Figure 2.8, pivoting can in- 
crease the upper bandwidth by at most bz. Later permutations can reorder 
the entries of earlier columns so that entries of L may lie below subdiagonal bz 
but no new nonzeros can be introduced, so the storage needed for L remains 
by per column. 

Gaussian elimination and Cholesky for band matrices are available in LA- 
PACK routines like ssbsv and sspsv. 


Band matrices often arise from discretizing physical problems with nearest 
neighbor interactions on a mesh (provided the unknowns are ordered rowwise 
or columnwise; see also Example 2.9 and section 6.3). 


EXAMPLE 2.8. Consider the ordinary differential equation (ODE) y”(x) — 
p(x)y'(x) — g(x)y(x) = r(x) on the interval [a,b] with boundary conditions 
yla) = a, y(b) = 8. We also assume q(x) > q > 0. This equation may be used 
to model the heat flow in a long, thin rod, for example. To solve the differential 
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bytby 


by 


Fig. 2.8. Band LU factorization with partial pivoting. 


equation numerically, we discretize it by seeking its solution only at the evenly 
spaced mesh points x; = a+ih,i=0,...,N+1, where h = (b—a)/(N +1) is 
the mesh spacing. Define p; = p(;), ri = r(x), and q = q(a;). We need to de- 
rive equations to solve for our desired approximations y; ~ y(x;), where yo = a 
and yn+1 = (. To derive these equations, we approximate the derivative y/(x;) 
by the following finite difference approximation: 

~ Yi+1 — Vi-1 


/ ag ees = ey 


(Note that as h gets smaller, the right-hand side approximates y’(x;) more and 
more accurately.) We can similarly approximate the second derivative by 


mey n Yitd — 2Yi + Yi- 
y (xi) fad h2 fe 
(See section 6.3.1 in Chapter 6 for a more detailed derivation.) 
Inserting these approximations into the differential equation yields 
Yi+1 — 2yi + Yi-1 plies vir 
h? ° 2h 
Rewriting this as a linear system we get Ay = b, where 


aiea 


S e fo p= Sls 
(ea (ea ee ee al E 


GUi=Ti, 1<i<N. 


0 
YN TN (4 — 2pn) GB 
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and 
at —Cl 
b ai = 1+ K qi 
ae 4 l ; b = 3[1+ $p] 
E “e  CN=-1 L h 
| —bn an | an all 2P: 


Note that a; > 0, and also b; > 0 and c; > 0 if h is small enough. 

This is a nonsymmetric tridiagonal system to solve for y. We will show 
how to change it to a symmetric positive definite tridiagonal system, so that 
we may use band Cholesky to solve it. 


Choose D = diag(1, T Vf BE SE). Then we may change 
Ay = b to (DAD~')(Dy) = Db or Ağ = b, where 


levee | 


—Jeb2 a2 —v/c2b3 
A= | —V/c2b3 E | x 
| a = —/¢en-1bn | 
—4/cn—10N an 


It is easy to see that A is symmetric, and it has the same eigenvalues as 
A because A and A = DAD™! are similar. (See section 4.2 in Chapter 4 for 
details.) We will use the next theorem to show it is also positive definite. 


THEOREM 2.9. Gershgorin. Let B be an arbitrary matriz. Then the eigenval- 
ues A of B are located in the union of the n disks 


|A — brr| < $` [xsl 
j=k 
Proof. Given à and x = 0 such that Ba = Ax, let 1 = ||z||œo = £k by 
scaling x if necessary. Then Di bkj£j = ALTE = A, 80 À — bkk = ae bkj£j, 
j=k 


implying 


A= Bisel S bala S brl 
jak jak 


Now if h is so small that for all i, |p; < 1, then 


1 h 1 h h? h? 
lbi] + cil = 9 (1+ Žr) To (1 Fp) A o ty dee ee 
Therefore all eigenvalues of A lie inside the disks centered at 1 + h?q;/2 > 
1+ h?q/ 2 with radius 1; in particular, they must all have positive real parts. 
Since A is symmetric, its eigenvalues are real and hence positive, so A is positive 
definite. Its smallest eigenvalue is bounded below by gh? /2. Thus, it can be 
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solved by Cholesky. The LAPACK subroutine for solving a symmetric positive 
definite tridiagonal system is sptsv. 

In section 4.3 we will again use Gershgorin’s theorem to compute pertur- 
bation bounds for eigenvalues of matrices. © 


2.7.4. General Sparse Matrices 


A sparse matrix is defined to be a matrix with a large number of zero entries. 
In practice, this means a matrix with enough zero entries that it is worth using 
an algorithm that avoids storing or operating on the zero entries. Chapter 6 
is devoted to methods for solving sparse linear systems other than Gaussian 
elimination and its variants. There are a large number of sparse methods, and 
choosing the best one often requires substantial knowledge about the matrix 
[24]. In this section we will only sketch the basic issues in sparse Gaussian 
elimination and give pointers to the literature and available software. 

To give a very simple example, consider the following matrix, which is 
ordered so that GEPP does not permute any rows: 


1 1. 
d -l-l 1 1 .96 


A is called an arrow matrix because of the pattern of its nonzero entries. Note 
that none of the zero entries of A were filled in by GEPP so that L and U 
together can be stored in the same space as the nonzero entries of A. Also, if 
we count the number of essential arithmetic operations, i.e., not multiplication 
by zero or adding zero, there are only 12 of them (4 divisions to compute the 
last row of L and 8 multiplications and additions to update the (5,5) entry), 
instead of 2n3 x~ 83. More generally, if A were an n-by-n arrow matrix, it 
would take only 3n — 2 locations to store it instead of n?, and 3n — 3 floating 
point operations to perform Gaussian elimination instead of zn}, When n is 
large, both the space and operation count become tiny compared to a dense 
matrix. 

Suppose that instead of A we were given A’, which is A with the order 
of its rows and columns reversed. This amounts to reversing the order of the 
equations and of the unknowns in the linear system Ax = b. GEPP applied to 
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A’ again permutes no rows, and to two decimal places we get 


1 1 1 1 1 
li 1 
Ab=] 1 1 = D'U" 
1 1 
1 1 
1 1 l 1 1 1 
[a 1 | | 99 —.01 —.01 “o | 
=| .1 —.01 1 . 99 —.01 —.01 |. 
E —.01 —.01 1 | | .99 —o1 | 
.1 —.01 —.01 —.01 1 .99 | 


Now we see that L’ and U’ have filled in completely and require n? storage. 
Indeed, after the first step of the algorithm all the nonzeros of A’ have filled 
in, so we must do the same work as dense Gaussian elimination, en3. 

This illustrates that the order of the rows and columns is extremely im- 
portant for saving storage and work. Even if we do not have to worry about 
pivoting for numerical stability (such as in Cholesky), choosing the optimal per- 
mutations of rows and columns to minimize storage or work is an extremely 
hard problem. In fact, it is NP-complete [109], which means that all known 
algorithms for finding the optimal permutation run in time which grows expo- 
nentially with n, and so are vastly more expensive than even dense Gaussian 
elimination for large n. Thus we must settle for using heuristics, of which there 
are several successful candidates. We illustrate some of these below. 

In addition to the complication of choosing a good row and column per- 
mutation, there are other reasons sparse Gaussian elimination or Cholesky are 
much more complicated than their dense counterparts. First, we need to design 
a data structure that holds only the nonzero entries of A; there are several in 
common use [91]. Next, we need a data structure to accommodate new entries 
of L and U that fill in during elimination. This means that either the data 
structure must grow dynamically during the algorithm or we must cheaply 
precompute it without actually performing the elimination. Finally, we must 
use the data structure to perform only the minimum number of floating point 
operations and at most proportionately many integer and logical operations. 
In other words, we cannot afford to do O(n?) integer and logical operations to 
discover the few floating point operations that we want to do. A more complete 
discussion of these algorithms is beyond the scope of this book [112, 91], but 
we will indicate available software. 


EXAMPLE 2.9. We illustrate sparse Cholesky on a more realistic example that 
arises from modeling the displacement of a mechanical structure subject to 
external forces. Figure 2.9 shows a simple mesh of a two-dimensional slice of 
a mechanical structure with two internal cavities. The mathematical problem 
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Mechanical Structure with Mesh 
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Fig. 2.9. Mesh for a mechanical structure. 


is to compute the displacements of all the grid points of the mesh (which 
are internal to the structure) subject to some forces applied to the boundary 
of the structure. The mesh points are numbered from 1 to n = 483; more 
realistic problems would have much larger values of n. The equations relating 
displacements to forces leads to a system of linear equations Ax = b, with 
one row and column for each of the 483 mesh points and with a;; = 0 if and 
only if mesh point i is connected by a line segment to mesh point j. This 
means that A is a symmetric matrix; it also turns out to be positive definite, 
so that we can use Cholesky to solve Ax = b. Note that A has only nz = 3971 
nonzeros of a possible 483? = 233289, so A is just 3519/233289 = 1.7% filled. 
(See Examples 4.1 and 5.1 for similar mechanical modeling problems, where 
the matrix A is derived in detail.) 


Figure 2.10 shows the same mesh (above) along with the nonzero pattern 
of the matrix A (below), where the 483 nodes are ordered in the “natural” 
way, with the logically rectangular substructures numbered rowwise, one sub- 
structure after the other. The edges in each such substructure have a common 
color, and these colors match the colors of the nonzeros in the matrix. Each 
substructure has a label “(i : j)” to indicate that it corresponds to rows and 
columns i through j of A. The corresponding submatrix A(i : j,i : j) is a 
narrow band matrix. (Example 2.8 and section 6.3 describe other situations 
in which a mesh leads to a band matrix.) The edges connecting different sub- 
structures are red and correspond to the red entries of A, which are farthest 
from the diagonal of A. 
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Matrix A, in natural order 
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Fig. 2.10. The edges in the mesh at the top are colored and numbered to match the 
sparse matrix A at the bottom. For example the first 49 nodes of the mesh (the leftmost 


green nodes) correspond to rows and columns 1 through 49 of A. 
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The top pair of plots in Figure 2.11 again shows the sparsity structure of 
A in the natural order, along with the sparsity structure of its Cholesky factor 
L. Nonzero entries of L corresponding to nonzero entries of A are black; new 
nonzeros of L, called fill-in, are red. L has 11533 nonzero entries, over five 
times as many as the lower triangle of A. Computing L by Cholesky costs just 
296923 flops, just .8% of the in? = 3.76 - 107 flops that dense Cholesky would 
have required. 

The number of nonzeros in L and the number of flops required to compute 
L can be changed significantly by reordering the rows and columns of A. The 
middle pair of plots in Figure 2.11 show the results of one such popular re- 
ordering, called reverse Cuthill-McKee [112, 91], which is designed to make A 
a narrow band matrix. As can be seen, it is quite successful at this, reducing 
the fill-in of L 21% (from 11533 to 9073) and reducing the flop count almost 
39% (from 296923 to 181525). 

Another popular ordering algorithm is called minimum degree ordering 
[112, 91] which is designed to create as little fill-in at each step of Cholesky as 
possible. The results are shown in the bottom pair of plots in Figure 2.11: the 
fill-in of L is reduced a further 7% (from 9073 to 8440) but the flop count is 
increased 9% (from 181525 to 198236). o 


Many sparse matrix examples are available as built-in demos in Matlab, 
which also has many sparse matrix operations built into it (type “help sparfun” 
in Matlab for a list). To see the examples, type demo in Matlab, then click 
on “continue,” then on “Matlab/Visit,” and then on either “Matrices/Select 
a demo/Sparse” or “Matrices/Select a demo/Cmd line demos.” For example, 
Figure 2.12 shows a Matlab example of a mesh around a wing, where the goal is 
to compute the airflow around the wing at the mesh points. The corresponding 
partial differential equations of airflow lead to a nonsymmetric linear system 
whose sparsity pattern is also shown. 


Sparse Matrix Software 


Besides Matlab, there is a variety of public domain and commercial sparse 
matrix software available in Fortran or C. Since this is still an active research 
area (especially with regard to high-performance machines), it is impossible 
to recommend a single best algorithm. Table 2.2 [175] gives a list of available 
software, categorized in several ways. We restrict ourselves to supported codes 
(either public or commercial) or else research codes when no other software is 
available for that type of problem or machine. We refer to [175, 92] for more 
complete lists and explanations of the algorithms below. 

Table 2.2 is organized as follows. The top group of routines, labeled se- 
rial algorithms, are designed for single-processor workstations and PCs. The 
shared-memory algorithms are for symmetric multiprocessors, such as the SUN 
SPARCcenter 2000 [236], SGI Power Challenge [221], DEC AlphaServer 8400 [101], 
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Fig. 2.11. Sparsity and flop counts for A with various orderings. 
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Finite Element Mesh of NASA Airfoil 
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Fig. 2.12. Mesh around the NASA airfoil. 
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and Cray C90/J90 [251, 252]. The distributed-memory algorithms are for ma- 
chines such as the IBM SP-2 [254], Intel Paragon [255], Cray T3 series [253], 
and networks of workstations [9]. As you can see, most software has been 
written for serial machines, some for shared-memory machines, and very little 
(besides research software) for distributed memory. 

The first column gives the matriz type. The possibilities include nonsym- 
metric, symmetric pattern (i.e. either aj; = aji = 0, or both can be nonzero 
and unequal), symmetric (and possibly indefinite), and symmetric positive 
definite (s.p.d.). The second column gives the name of the routine or of the 
authors. 

The third column gives some detail on the algorithm, indeed more than 
we have explained in detail in the text: LL (left looking), RL (right look- 
ing), frontal, MF (multifrontal), and LDL’ refer to different ways to organize 
the three nested loops defining Gaussian elimination. Partial, Markowitz, and 
threshold refer to different pivoting strategies. 2D-blocking refers to which 
parallel processors are responsible for which parts of the matrix. CAPSS as- 
sumes that the linear system is defined by a grid and requires the x, y, andz 
coordinates of the grid points in order to distribute the matrix among the 
processors. 

The third column also describes the organization of the innermost loop, 
which could be BLAS1, BLAS2, BLAS3, or scalar. SD refers to the algorithm 
switching to dense Gaussian elimination after step k when the trailing (n — k- 
by-(n — k) submatrix is dense enough. 

The fifth column describes the status and availability of the software, in- 
cluding whether it is public or commercial and how to get it. 


2.7.5. Dense Matrices Depending on Fewer Than O(n?) Parame- 
ters 


This is a catch-all heading, which includes a large variety of matrices that arise 
in practice. We mention just a few cases. 
Vandermonde matrices are of the form 


1 1 1 
XO Tı Tn 
2 2 2 


Note that the matrix-vector multiplication 


T 
T T i i 
Valarin] = > QiZ t, ) azh 


is equivalent to polynomial evaluation; therefore, solving VTa = y is polyno- 


mial interpolation. Using Newton interpolation we can solve VTa = y in on? 
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Matrix Status/ 

type Name Algorithm source 

Serial algorithms 

nonsym. | SuperLU LL, partial, BLAS-2.5 Pub/UCB 

nonsym. | UMFPACK (61, 62] MF, Markowitz, BLAS-3 Pub/NETLIB 
MA38 (same as UMFPACK) Com/HSL 

nonsym. | MA48 [94] Anal: RL, Markowitz Com/HSL 

Fact: LL, partial, BLAS-1, SD 

nonsym. | SPARSE [165] RL, Markowitz, scalar Pub/NETLIB 

sym- MUPS [5] MF, threshold, BLAS-3 Com/HSL 

A { MA42 [96] Frontal, BLAS-3 Com/HSL 

sym. MAQ7 [95]/MA47 [93] MF, LDL", BLAS-1 Com/HSL 

s.p.d. Ng & Peyton [189] LL, BLAS-3 Pub/Author 

Shared-memory algorithms 

nonsym. | SuperLU LL, partial, BLAS-2.5 Pub/UCB 

nonsym. | PARASPAR, [268, 269] RL, Markowitz, BLAS-1, SD Res/ Author 

sym- MUPS [6] MF, threshold, BLAS-3 Res/ Author 

pattern 

nonsym. | George & Ng [113] RL, partial, BLAS-1 Res/ Author 

s.p.d. Gupta et al., [131] LL, BLAS-3 Com/SGI 

Pub/Author 

s.p.d. SPLASH [153] RL, 2-D block, BLAS-3 Pub/Stanford 

Distributed-memory algorithms 

sym. van der Stappen [243] RL, Markowitz, scalar Res/Author 

sym- Lucas et al. [178] MF, no pivoting, BLAS-1 Res/ Author 

pattern 

s.p.d. Rothberg & Schreiber [205] | RL, 2-D block, BLAS-3 Res/ Author 

s.p.d. Gupta & Kumar [130] MF, 2-D block, BLAS-3 Res/ Author 

s.p.d. CAPSS [141] MF, full parallel, BLAS-1 Pub/NETLIB 


(require coordinates) 


Table 2.2. Software to solve sparse linear systems using direct methods. 
Abbreviations used in the table: 
nonsym. = nonsymmetric. 
sym-pattern = symmetric nonzero structure, nonsymmetric values. 
sym. = symmetric and may be indefinite. 
s.p.d = symmetric and positive definite. 
MF, LL, and RL = multifrontal, left-looking, and right-looking. 
SD = switches to a dense code on a sufficiently dense trailing submatrix. 
Pub = publicly available; authors may help use the code. 
Res = published in literature but may not be available from the authors. 
Com = commercial. 
HSL = Harwell Subroutine Library: 
http: //www.rl.ac.uk/departments/ccd/numerical/hsl/hsl.html. 
UCB = http://www.cs.berkeley.edu/~xiaoye/superlu. html. 
Stanford = http://www-flash.stanford.edu/apps/SPLASH/. 
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instead of $n’ flops. There is a similar trick to solve Va = y in 3n? flops too. 
See [119, p. 178.]. 
Cauchy matrices C have entries 


where a = [a1,...,@n], 0 = [b1 ---, bnl, E = let Gal, and n = [m,.. | 
are given vectors. The best-known example is the notoriously ill-conditioned 
Hilbert matrix H, with hi; = 1/(i+j—1). These matrices arise in interpolating 
data by rational functions: Suppose that we want to find the coefficients x; of 
the rational function with fixed poles nj 


n 


f= = 


ja 9 


such that f(&) = yi fori = 1 to n. Taken together these n equations f(&;) = yi 
form an n-by-n linear system with a coefficient matrix that is Cauchy. The 
inverse of a Cauchy matrix turns out to be a Cauchy matrix, and there is a 
closed form expression for C~!, based on its connection with interpolation: 


(Ciy = By 0s (E — m) Pim) Qu(-&), 


where P;(-) and Q;(-) are the Lagrange interpolation polynomials 
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i.e., they are constant along diagonals. They arise in problems of signal pro- 
cessing. There are algorithms for solving such systems that take only O(n?) 
operations. 

All these methods generalize to many other similar matrices depending on 
only O(n) parameters. See [119, p. 183] or [158] for a recent survey. 


2.8. References and Other Topics for Chapter 2 


Further details about linear equation solving in general may be found in chap- 
ters 3 and 4 of [119]. The reciprocal relationship between condition numbers 
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and distance to the nearest ill-posed problem is further explored in [70]. An 
average case analysis of pivot growth is described in [240], and an example of 
bad pivot growth with complete pivoting is given in [120]. Condition estima- 
tors are described in [136, 144, 146]. Single precision iterative refinement is 
analyzed in [14, 223, 224]. A comprehensive discussion of error analysis for 
linear equation solvers, which covers most of these topics, can be found in 
[147]. 

For symmetric indefinite factorization, see [43]. Sparse matrix algorithms 
are described in [112, 91] as well as the numerous references in Table 2.2. 
Implementations of many of the algorithms for dense and band matrices de- 
scribed in this chapter are available in LAPACK and CLAPACK [10], which 
includes a discussion of block algorithms suitable for high-performance com- 
puters. The BLAS are described in [85, 87, 167]. These and other routines are 
available electronically in NETLIB. An analysis of blocking strategies for ma- 
trix multiplication is given in [149]. Strassen’s matrix multiplication algorithm 
is presented in [3], its performance in practice is described in [22], and its nu- 
merical stability is described in [76, 147]. A survey of parallel and other block 
algorithms is given in [75]. For a recent survey of algorithms for structured 
dense matrices depending only on O(n) parameters, see [158]. 


2.9. Questions for Chapter 2 


QUESTION 2.1. (Easy) Using your favorite World Wide Web browser, go to 
NETLIB (http://www.netlib.org), and answer the following questions. 


1. You need a Fortran subroutine to compute the eigenvalues and eigenvec- 
tors of real symmetric matrices in double precision. Find one using the 
Attribute /Value database search on the NETLIB repository. Report the 
name and URL of the subroutine as well as how you found it. 


2. Using the Performance Database Server, find out the current world speed 
record for solving 100-by-100 dense linear systems using Gaussian elimi- 
nation. What is the speed in Mflops, and which machine attained it? Do 
the same for 1000-by-1000 dense linear systems and “big as you want” 
dense linear systems. Using the same database, find out how fast your 
workstation can solve 100-by-100 dense linear systems. Hint: Look at 
the LINPACK benchmark. 


QUESTION 2.2. (Easy) Consider solving AX = B for X, where A is n-by- 
n, and X and B are n-by-m. There are two obvious algorithms. The first 
algorithm factorizes A = PLU using Gaussian elimination and then solves for 
each column of X by forward and back substitution. The second algorithm 
computes AT! using Gaussian elimination and then multiplies X = A7!B. 
Count the number of flops required by each algorithm, and show that the first 
one requires fewer flops. 
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QUESTION 2.3. (Medium) Let ||- || be the two-norm. Given a nonsingular 
matrix A and a vector b, show that for sufficiently small || A||, there are nonzero 
6A and 6b such that inequality (2.2) is an equality. This justifies calling «(A) = 
|| A~+|| - || Al] the condition number of A. Hint: Use the ideas in the proof of 
Theorem 2.1. 


QUESTION 2.4. (Hard) Show that bounds (2.7) and (2.8) are attainable. 


QUESTION 2.5. (Medium) Prove Theorem 2.3. Given the residual r = Az — b, 
use Theorem 2.3 to show that bound (2.9) is no larger than bound (2.7). This 
explains why LAPACK computes a bound based on (2.9), as described in 
section 2.4.4. 


QUESTION 2.6. (Easy) Prove Lemma 2.2. 


QUESTION 2.7. (Easy; Z. Bai) If A is a nonsingular symmetric matrix and 
has the factorization A = LDMT, where L and M are unit lower triangular 
matrices and D is a diagonal matrix, show that L = M. 


QUESTION 2.8. (Hard) Consider the following two ways of solving a 2-by-2 
linear system of equations: 


Ar =| on AE a 
a21 Q22 T2 b2 l 
Algorithm 1. Gaussian elimination with partial pivoting (GEPP). 


Algorithm 2. Cramer’s rule: 


det = ay, * G22 — Q12 * Q21, 
xı = (az2*bı — a12 * b2)/det, 
T2? = (—a21 x by + a11 * bz) /det. 


Show by means of a numerical example that Cramer’s rule is not backward 
stable. Hint: Choose the matrix nearly singular and [bı bo]? œ~ [a12 age]’. 
What does backward stability imply about the size of the residual? Your 
numerical example can be done by hand on paper (for example, with four- 
decimal-digit floating point), on a computer, or a hand calculator. 


QUESTION 2.9. (Medium) Let B be an n-by-n upper bidiagonal matrix, i.e., 
nonzero only on the main diagonal and first superdiagonal. Derive an algorithm 
for computing Kæ(B) = ||Blloo||B~"||oo exactly (ignoring roundoff). In other 
words, you should not use an iterative algorithm such as Hager’s estimator. 
Your algorithm should be as cheap as possible; it should be possible to do using 
no more than 2n — 2 additions, n multiplications, n divisions, 4n — 2 absolute 
values, and 2n — 2 comparisons. (Anything close to this is acceptable.) 
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QUESTION 2.10. (Easy; Z. Bai) Let A be n-by-m with n > m. Show that 
AT All = |A| and r2(47A) = s2(4)?. 

Let M be n-by-n and positive definite and L be its Cholesky factor so that 
M = LL". Show that ||M|l2 = ||L||2 and «2(M) = K2(L)?. 


QUESTION 2.11. (Easy; Z. Bai) Let A be symmetric and positive definite. 
Show that | aij < (anaj). 


QUESTION 2.12. (Easy; Z. Bai) Show that if 


oe 
rela) 


where J is an n-by-n identity matrix, then kp(Y) = ||Y||e||Y4 |p = 2n + 
IZI- 


QUESTION 2.13. (Medium; From the 1995 Final Examination) In this ques- 
tion we will ask how to solve By = c given a fast way to solve Ax = b, where 
A — B is “small” in some sense. 


1. Prove the Sherman-Morrison formula: Let A be nonsingular, u and v 

be column vectors, and A + uv! be nonsingular. Then (A + uv’)~! = 
ATI — (Attuv? AT!) /(1 + vT Atu). 
More generally, prove the Sherman-—Morrison—Woodbury formula: Let 
U and V be n-by-k rectangular matrices, where k < n and A is n-by- 
n. Then T = I + V?A7!U is nonsingular if and only if A+ UVT is 
nonsingular, in which case (A+ UV?)-1 = At — AIUT IVT AI. 


2. If you have a fast algorithm to solve Ax = b, show how to build a fast 
solver for By = c, where B= A+ wf. 


3. Suppose that || A— B|| is “small” and you have a fast algorithm for solving 
Ax = b. Describe an iterative scheme for solving By = c. How fast do 
you expect your algorithm to converge? Hint: Use iterative refinement. 


QUESTION 2.14. (Medium; Programming) Use Netlib to obtain a subroutine 
to solve Ax = b using Gaussian elimination with partial pivoting. You should 
get it from either LAPACK (in Fortran, NETLIB/lapack) or CLAPACK (in 
C, NETLIB/clapack); sgesvx is the main routine in both cases. (There is also 
a simpler routine sgesv that you might want to look at.) Modify sgesvx (and 
possibly other subroutines that it calls) to perform complete pivoting instead 
of partial pivoting; call this new routine gecp. It is probably simplest to 
modify sgetf2 and use it in place of sgetrf. Test sgesvx and gecp on a 
number of randomly generated matrices of various sizes up to 30 or so. By 
choosing x and forming b = Az, you can use examples for which you know the 
right answer. Check the accuracy of the computed answer ĉ as follows. First, 
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examine the error bounds FERR (“Forward ERRor”) and BERR (“Backward 
ERRor”) returned by the software; in your own words, say what these bounds 
mean. Using your knowledge of the exact answer, verify that FERR is correct. 
Second, compute the exact condition number by inverting the matrix explicitly, 
and compare this to the estimate RCOND returned by the software. (Actually, 
RCOND is an estimate of the reciprocal of the condition number.) Third, confirm 


that a ll is bounded by a modest multiple of macheps/RCOND. Fourth, you 
should verify that the (scaled) backward error R = ||A% — 6||/((||Al| - |||] + 
||b||) - macheps) is of order unity in each case. 

More specifically, your solution should consist of a well-documented pro- 
gram listing of gecp, an explanation of which random matrices you generated 
(see below), and a table with the following columns (or preferably graphs of 
each column of data, plotted against the first column): 


e test matrix number (to identify it in your explanation of how it 
was generated); 
e its dimension; 
e from sgesvx: 
—the pivot growth factor returned by the code 
(this should ideally not be much larger than 1), 
—its estimated condition number (1/RCOND), 
—the ratio of 1/RCOND to your explicitly computed condition 
number (this should ideally be close to 1), 
—the error bound FERR, 
—the ratio of FERR to the true error 
(this should ideally be at least 1 but not much larger 
unless you are “lucky” and the true error is zero), 
—the ratio of the true error to ¢/RCOND 
(this should ideally be at most 1 or a little less, 
unless you are “lucky” and the true error is zero), 
—the scaled backward error R/e 
(this should ideally be O(1) or perhaps O(n)), 
—the backward error BERR/¢ 
(this should ideally be O(1) or perhaps O(n)), 
—the run time in seconds, 
e the same data for gecp as for sgesvx. 


You need to print the data to only one decimal place, since we care only about 
approximate magnitudes. Do the error bounds really bound the errors? How 
do the speeds of sgesvx and gecp compare? 

It is difficult to obtain accurate timings on many systems, since many timers 
have low resolution, so you should compute the run time as follows: 


tı = time-so-far 
for i = 1 tom 
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set up problem 
solve the problem 
endfor 
ty = time-so-far 
fori = 1 tom 
set up problem 
endfor 
t3 = time-so-far 
t = ((t2 — t1) — (t3 — t2))/m 


m should be chosen large enough so that t2 — tı is at least a few seconds. Then 
t should be a reliable estimate of the time to solve the problem. 

You should test some well-conditioned problems as well as some that are 
ill-conditioned. To generate a well-conditioned matrix, let P be a permuta- 
tion matrix, and add a small random number to each entry. To generate an 
ill-conditioned matrix, let L be a random lower triangular matrix with tiny 
diagonal entries and moderate subdiagonal entries. Let U be a similar upper 
triangular matrix, and let A = LU. (There is also an LAPACK subroutine 
slatms for generating random matrices with a given condition number, which 
you may use if you like.) 

Also try both solvers on the following class of n-by-n matrices for n = 1 
up to 30. (If you run in double precision, you may need to run up to n = 60.) 
Shown here is just the case n = 5; the others are similar: 


i 0; sO. 50: a 
Nee cae oa 

-1 -1 1 O1f. 

-1 -1 -1 11 


1 1 1 1 1 


Explain the accuracy of the results in terms of the error analysis in section 2.4. 

Your solution should not contain any tables of matrix entries or solution 
components. 

In addition to teaching about error bounds, one purpose of this question is 
to show you what well-engineered numerical software looks like. In practice, 
one will often use or modify existing software instead of writing one’s own from 
scratch. 


QUESTION 2.15. (Medium; Programming) This problem depends on Ques- 
tion 2.14. Write another version of sgesvx called sgesvddouble that com- 
putes the residual in double precision during iterative refinement. Modify the 
error bound FERR in sgesvx to reflect this improved accuracy. Explain your 
modification. (This may require you to explain how sgesvx computes its error 
bound in the first place.) On the same set of examples as in the last question, 
produce a similar table of data. When is sgesvxdouble more accurate than 
sgesvx? 
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QUESTION 2.16. (Hard) Show how to reorganize the Cholesky algorithm (Al- 
gorithm 2.11) to do most of its operations using Level 3 BLAS. Mimic Algo- 
rithm 2.10. 


QUESTION 2.17. (Easy) Suppose that, in Matlab, you have an n-by-n matrix 
A and an n-by-1 matrix b. What do A\b, b'/A, and A/b mean in Matlab? How 
does A\b differ from inv(A) * b? 


QUESTION 2.18. (Medium) Let 


Ay Aig 
a 
| Aoi A22 | 


where Aj, is k-by-k and nonsingular. Then S = A22 — An Ay A12 is called the 
Schur complement of A,, in A, or just Schur complement for short. 


1. Show that after k steps of Gaussian elimination without pivoting, A292 
has been overwritten by S. 


2. Suppose A = AT, Aj, is positive definite and A2 is negative definite 
(— Age is positive definite). Show that A is nonsingular, that Gaussian 
elimination without pivoting will work in exact arithmetic, but (by means 
of a 2-by-2 example) that Gaussian elimination without pivoting may be 
numerically unstable. 


QUESTION 2.19. (Medium) Matrix A is called strictly column diagonally dom- 
inant, or diagonally dominant for short, if 


n 


jaul > S lajk 


j=l, j=i 
e Show that A is nonsingular. Hint: Use Gershgorin’s theorem. 


e Show that Gaussian elimination with partial pivoting does not actually 
permute any rows, i.e., that it is identical to Gaussian elimination without 
pivoting. Hint: Show that after one step of Gaussian elimination, the 
trailing (n — 1)-by-(n — 1) submatrix, the Schur complement of ay, in A, 
is still diagonally dominant. (See Question 2.18 for more discussion of 
the Schur complement.) 


QUESTION 2.20. (Easy; Z. Bai) Given an n-by-n nonsingular matrix A, how 
do you efficiently solve the following problems, using Gaussian elimination with 
partial pivoting? 


(a) Solve the linear system A*ax = b, where k is a positive integer. 


(b) Compute a = cT Atb. 
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(c) Solve the matrix equation AX = B, where B is n-by-m. 


You should (1) describe your algorithms, (2) present them in pseudocode (using 
a Matlab-like language; you should not write down the algorithm for GEPP), 
and (3) give the required flops. 


QUESTION 2.21. (Medium) Prove that Strassen’s algorithm (Algorithm 2.8) 
correctly multiplies n-by-n matrices, where n is a power of 2. 


Linear Least Squares Problems 


3.1. Introduction 


Given an m-by-n matrix A and an m-by-1 vector b, the linear least squares 
problem is to find an n-by-1 vector x minimizing ||Azx — bļ||2. If m = n and 
A is nonsingular, the answer is simply x = A~!b. But if m > n so that we 
have more equations than unknowns, the problem is called overdetermined, 
and generally no x satisfies Az = b exactly. One occasionally encounters the 
underdetermined problem, where m < n, but we will concentrate on the more 
common overdetermined case. 

This chapter is organized as follows. The rest of this introduction describes 
three applications of least squares problems, to curve fitting, to statistical mod- 
eling of noisy data, and to geodetic modeling. Section 3.2 discusses three stan- 
dard ways to solve the least squares problem: the normal equations, the QR 
decomposition, and the singular value decomposition (SVD). We will frequently 
use the use SVD as a tool in later chapters, so we derive several of its properties 
(although algorithms for the SVD are left to Chapter 5). Section 3.3 discusses 
perturbation theory for least squares problems, and section 3.4 discusses the 
implementation details and roundoff error analysis of our main method, QR 
decomposition. The roundoff analysis applies to many algorithms using or- 
thogonal matrices, including many algorithms for eigenvalues and the SVD in 
Chapters 4 and 5. Section 3.5 discusses the particularly ill-conditioned situa- 
tion of rank-deficient least squares problem and how to solve them accurately. 
Section 3.7 and the questions at the end of the chapter give pointers to other 
kinds of least squares problems and to software for sparse problems. 


EXAMPLE 3.1. A typical application of least squares is curve fitting. Suppose 
that we have m pairs of numbers (y1, b1), ---; (Ym; 0m) and that we want to find 
the “best” cubic polynomial fit to b; as a function of y;. This means finding 
polynomial coefficients 71, ..., 24 so that the polynomial p(y) = DE o> 
minimizes the residual r; = p(y;) — b; for i = 1 to m. We can also write this as 
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minimizing 
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where r and b are m-by-1, A is m-by-4, and x is 4-by-1. To minimize r, 
we could choose any norm, such as ||r|loo; Irl; or ||r|l2. The last one, which 
corresponds to minimizing the sum of the squared residuals }7/" , r?, is a linear 
least squares problem. 

Figure 3.1 shows an example, where we fit polynomials of increasing degree 
to the smooth function b = sin(zy/5)+y/5 at the 23 points y = —5, —4.5, —4, 
..., 5.5, 6. The left side of Figure 3.1 plots the data points as circles, and four 
different approximating polynomials of degrees 1, 3, 6, and 19. The right side 
of Figure 3.1 plots the residual norm ||r||2 versus degree for degrees from 1 to 
20. Note that as the degree increases from 1 to 17, the residual norm decreases. 
We expect this behavior, since increasing the polynomial degree should let us 
fit the data better. 

But when we reach degree 18, the residual norm suddenly increases dra- 
matically. We can see how erratic the plot of the degree 19 polynomial is on 
the left (the blue line). This is due to ill-conditioning, as we will later see. 
Typically, one does polynomial fitting only with relatively low degree poly- 
nomials, avoiding ill-conditioning [60]. Polynomial fitting is available as the 
function polyfit in Matlab. 

Here is an alternative to polynomial fitting. More generally, one has a set 
of independent functions f1(y),...,fn(y) from R? to R and a set of points 
(y1, b1), ---, (Ym, bm) with y; € R? and b; € R, and one wishes to find a best 
fit to these points of the form b = >= x£jfj(y). In other words one wants 
to choose x = [x1,...,2n]’ to minimize the residuals r; = jai vj Fi (Ys) — bi 
for 1 <i <m. Letting ai; = f;(yi), we can write this as r = Ax — b, where 
A is m-by-n, x is n-by-1, and b and r are m-by-1. A good choice of basis 
functions f;(y) can lead to better fits and less ill-conditioned systems than 
using polynomials [33, 82, 166]. © 


EXAMPLE 3.2. In statistical modeling, one often wishes to estimate certain 
parameters x; based on some observations, where the observations are con- 
taminated by noise. For example, suppose that one wishes to predict the 
college grade point average (GPA) (b) of freshman applicants based on their 
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Fig. 3.1. Polynomial fit to curve b = sin(ay/5) + y/5, and residual norms. 


high school GPA (a1) and two Scholastic Aptitude Test scores, verbal (a2) and 
quantitative (a3), as part of the college admissions process. Based on past 
data from admitted freshmen one can construct a linear model of the form 
b= ys a;x;. The observations are aj, aj2, aj3, and b;, one set for each of 
the m students in the database. Thus, one wants to minimize 


rı a11 @12 @13 bi 
r2 a21 Q22 Q23 vı b2 
r= = -| a | — = Á-xz—b, 
T3 
Tm AmI Am2 Am3 bm 


which we can do as a least squares problem. 

Here is a statistical justification for least squares, which is called linear 
regression by statisticians: assume that the a; are known exactly so that only 
b has noise in it, and that the noise in each b; is independent and normally 
distributed with 0 mean and the same standard deviation ø. Let x be the so- 
lution of the least squares problem and æy be the true value of the parameters. 
Then z is called a mazimum-likelihood estimate of xr, and the error x — x7 is 
normally distributed, with zero mean in each component and covariance ma- 
triz o?(ATA)~!. We will see the matrix (AT A)~! again below when we solve 
the least squares problem using the normal equations. For more details on the 
connection to statistics,!4 see, for example, [33, 257]. © 


™The standard notation in statistics differs from linear algebra: statisticians write XB = y 
instead of Ax = b. 
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EXAMPLE 3.3. The least squares problem was first posed and formulated by 
Gauss to solve a practical problem for the German government. There are 
important economic and legal reasons to know exactly where the boundaries 
lie between plots of land owned by different people. Surveyors would go out and 
try to establish these boundaries, measuring certain angles and distances and 
then triangulating from known landmarks. As population density increased, 
it became necessary to improve the accuracy to which the locations of the 
landmarks were known. So the surveyors of the day went out and remeasured 
many angles and distances between landmarks, and it fell to Gauss to figure 
out how to take these more accurate measurements and update the government 
database of locations. For this he invented least squares, as we will explain 
shortly [33]. 

The problem that Gauss solved did not go away and must be periodically 
revisited. In 1974 the US National Geodetic Survey undertook to update the 
US geodetic database, which consisted of about 700,000 points. The motiva- 
tions had grown to include supplying accurate enough data for civil engineers 
and regional planners to plan construction projects and for geophysicists to 
study the motion of tectonic plates in the earth’s crust (which can move up to 
5 cm per year). The corresponding least squares problem was the largest ever 
solved at the time: about 2.5 million equations in 400,000 unknowns. It was 
also very sparse, which made it tractable on the computers available in 1978, 
when the computation was done [162]. 

Now we briefly discuss the formulation of this problem. It is actually non- 
linear and solved by approximating it by a sequence of linear ones, each of 
which is a linear least squares problem. The data base consists of a list of 
points (landmarks), each labeled by location: latitude, longitude, and possibly 
elevation. For simplicity of exposition, we assume that the earth is flat and 
suppose that each point i is labeled by linear coordinates z; = (xj, yi)". For 
each point we wish to compute a correction 6z; = (62;, dy;)’ so that the cor- 
rected location z! = (v1, yi) = zi + 6z; more nearly matches the new, more 
accurate measurements. These measurements include both distances between 
selected pairs of points and angles between the line segment from point i to 
j and i to k (see Figure 3.1). To see how to turn these new measurements 
into constraints, consider the triangle in Figure 3.1. The corners are labeled 
by their (corrected) locations, and the angles 0 and edge lengths L are also 
shown. From this data, it is easy to write down constraints based on simple 
trigonometric identities. For example, an accurate measurement of 6; leads to 
the constraint 


2 [ 
cos* 0; = 

Oeo Zi)" (2 — zi) (p — He - z) 
where we have expressed cos 0; in terms of dot products of certain sides of 
the triangle. If we assume that 6z; is small compared to z;, then we can 


linearize this constraint as follows: multiply through by the denominator of 
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2 j= (Xj yj) 
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Z5= (x5 Jį) 


Fig. 3.2. Constraints in updating a geodetic database. 


the fraction, multiply out all the terms to get a quartic polynomial in all the 
“)-variables” (like ôzx;), and throw away all terms containing more than one 
6-variable as a factor. This yields an equation in which all 6-variables appear 
linearly. If we collect all these linear constraints from all the new angle and 
distance measurements together, we get an overdetermined linear system of 
equations for all the 6-variables. We wish to find the smallest corrections, i.e., 
the smallest values of 6x;, etc., that most nearly satisfy these constraints. This 
is a least squares problem. © 


Later, after we introduce more machinery, we will also show how image 
compression can be interpreted as a least squares problem (see Example 3.4). 


3.2. Matrix Factorizations That Solve the Linear Least 
Squares Problem 


The linear least squares problem has several explicit solutions that we now 
discuss: 


1. normal equations, 

2. QR decomposition, 

3. SVD, 

4. transformation to a linear system, (see Question 3.3). 


The first method is the fastest but least accurate; it is adequate when the 
condition number is small. The second method is the standard one and costs 
up to twice as much as the first method. The third method is of most use on an 
ill-conditioned problem, i.e., when A is not of full rank; it is several times more 
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expensive again. The last method lets us do iterative refinement and improve 
the solution when the problem is ill-conditioned. All methods but the third 
can be adapted to deal efficiently with sparse matrices [33]. We will discuss 
each solution in turn. We assume initially for methods 1 and 2 that A has full 
column rank n. 


3.2.1. Normal Equations 


To derive the normal equations, we look for the x where the gradient of || Ax — 
b||} = (Ax — b)! (Az — b) vanishes. So we want 


(A(x + e) — b)! (A(x +e) — b) — (Az — b)T (Az — b) 


0 = lim 
0 llell2 
_ 2eT (AT Ax — AT) + eT AT Ae 
= lim 
0 llell2 


T aT 2: 2 
The second term £ aot < i = ||A||ž||jel|2 approaches 0 as e goes to 


0, so the factor AT Ax— A’ in the first term must also be zero, or AT Ax = AD. 
This is a system of n linear equations in n unknowns, the normal equations. 

Why is x = (A? A)~!A?b the minimizer of || Ax — b||2? We can note that 
the Hessian A? A is positive definite, which means that the function is strictly 
convex and any critical point is a global minimum. Or we can complete the 
square by writing x’ = y + x and simplifying 


(Aa’ — b)" (Ax! —b) = (Ay+ Ax — b)" (Ay + Ax — b) 
= (Ay) (Ay) + (Ax — b)" (Aa — b) 
+2( Ay)? (Ax — b) 
= ||Ayl[} + | Aw — bll + 2y" (AT Ax — A*D) 
= ||Ayll} + Ax — bll. 


This is clearly minimized by y = 0. This is just the Pythagorean theorem, since 
the residual r = Ax — b is orthogonal to the space spanned by the columns of 
A, i.e., 0 = ATr = AT Ax — AT% as illustrated below (the plane shown is the 
span of the column vectors of A so that Ax, Ay, and Az’! = A(x + y) all lie in 
the plane): 
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r=Ax-b 


Ax’-b = A(x+y)-b 


Ax’ = A(x+y) 


Ay 


Since A’ A is symmetric and positive definite, we can use the Cholesky 
decomposition to solve the normal equations. The total cost of computing 
A’ A, ATb, and the Cholesky decomposition is n?m + in’ + O(n?) flops. Since 
m > n, the n?m cost of forming AT A dominates the cost. 


3.2.2. QR Decomposition 


THEOREM 3.1. QR decomposition. Let A be m-by-n with m > n. Suppose 
that A has full column rank. Then there exists a unique m-by-n orthogonal 
matrix Q (QTQ = In) and a unique n-by-n upper triangular matrix R with 
positive diagonals ri; > 0 such that A= QR. 


Proof. We give two proofs of this theorem. First, this theorem is a restatement 
of the Gram-Schmidt orthogonalization process [137]. If we apply Gram- 
Schmidt to the columns a; of A = [a1,a2,..., an] from left to right, we get 
a sequence of orthonormal vectors qı through qn spanning the same space: 
these orthogonal vectors are the columns of Q. Gram-Schmidt also computes 
coefficients rj; = qj ai expressing each column a; as a linear combination of qı 


through qi: a; = 2u rjiqj- The rj; are just the entries of R. 


ALGORITHM 3.1. The classical Gram-Schmidt (CGS) and modified Gram- 
Schmidt (MGS) Algorithms for factoring A = QR: 


for i=1 to n /* compute ith columns of Q and R */ 
qi = Qi 
for j=1 to i— 1 /* subtract components in qj direction from a; */ 


{ Tji = q7 ai CGS 


ri = GG MGS 
qi = qi — Pdi 
end for 
rit = |lille 
if ra = 0 /* a; is linearly dependent on aj,...,a;-1 */ 
quit 
end if 


qi = qi/Tü 


108 Applied Numerical Linear Algebra 


end for 


We leave it as an exercise to show that the two formulas for rj; in the algo- 
rithm are mathematically equivalent (see Question 3.1). If A has full column 
rank, r;; will not be zero. The following figure illustrates Gram-Schmidt when 
A is 2-by-2: 


The second proof of this theorem will use Algorithm 3.2, which we present 
in section 3.4.1. 

Unfortunately, CGS is numerically unstable in floating point arithmetic 
when the columns of A are nearly linearly dependent. MGS is more stable and 
will be used in algorithms later in this book but may still result in Q being far 
from orthogonal (||QQ? — I|| being far larger than £) when A is ill-conditioned 
[31, 32, 33, 147]. Algorithm 3.2 in section 3.4.1 is a stable alternative algorithm 
for factoring A = QR. See Question 3.2. 

We will derive the formula for the x that minimizes || Ax — bļ|2 using the 
decomposition A = QR in three slightly different ways. First, we can always 
choose m—n more orthonormal vectors Q so that IQ, Q] is a square orthogonal 
matrix (for example, we can choose any m — n more independent vectors X 
that we want and then apply Algorithm 3.1 to the n-by-n nonsingular matrix 
IQ, X]). Then 


|| Ax — b||3 (Q, Q]T (Ax —b)||} by part 4 of Lemma 1.7 


- Palco 


pnxn QTb 
= | Olm=n)xn | Rar — | ÕTb | 
Rz- QTd] ||" 
QTD 2 
Re — Q7? + ÂT 
Q" bli. 


2 


2 


IV 
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We can solve Rx — QTb = 0 for x, since A and R have the same rank, 
n, and so R is nonsingular. Then x = R~!Q*b, and the minimum value of 
|| Aa — bll is |Q7O|l2. 

Here is a second, slightly different derivation that does not use the matrix 
Q. Rewrite Ax — b as 


Az —b QRz — b = QRz« — (QQ? + I — QQ™)b 


Q(Rx — Q*b) + (I - QQ" )b. 


Note that the vectors Q(Rx — QTb) and (I — QQT)b are orthogonal, be- 


cause (Q(Rx — Q"b))"((I— QQ*)b) = (Rx — QT)" [Q* (I — QQ") |b = (Rx — 
QTb)T[0]b = 0. Therefore, by the Pythagorean theorem, 


lQ(Re — Q*b)|I3 + IZ - QQ*)all3 
Ra — QPS + IZ — QQ") a3, 


| Ax — bll 


where we have used part 4 of Lemma 1.7 in the form ||Qy||3 = ||y||3. This sum 
of squares is minimized when the first term is zero, i.e., x = R7'Q?b. 

Finally, here is a third derivation that starts from the normal equations 
solution: 


8 
ll 


(A? A)1 A? 
(R' QT QR) IRTQTb — (RTR) I RTQTb 
RRT RTQTb = RIQTO. 


Later we will show that the cost of this decomposition and subsequent least 
squares solution is 2n?m — Èn”, about twice the cost of the normal equations 
if m > n and about the same if m = n. 


3.2.3. Singular Value Decomposition 


The SVD is a very important decomposition which is used for many purposes 
other than solving least squares problems. 


THEOREM 3.2. SVD. Let A be an arbitrary m-by-n matrix with m > n. Then 
we can write A= UVT, where U is m-by-n and satisfies UTU = I, V is n-by- 


n and satisfies VTV = I, and X = diag(o1,...,@n), where o1 > ++» > on > 0. 
The columns ui,...,Un of U are called left singular vectors. The columns 
V1,...,Un Of V are called right singular vectors. The c; are called singular 


values. (If m < n, the SVD is defined by considering AT.) 


A geometric restatement of this theorem is as follows. Given any m-by-n 
matrix A, think of it as mapping a vector x € R” to a vector y = Ax € R”. 
Then we can choose one orthogonal coordinate system for R” (where the unit 
axes are the columns of V) and another orthogonal coordinate system for R™ 
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(where the units axes are the columns of U) such that A is diagonal (X), i.e., 
maps a vector x = J; Bu; to y = Ax = D7, giliu; In other words, 
any matrix is diagonal, provided we pick appropriate orthogonal coordinate 
systems for its domain and range. 

Proof of Theorem 3.2. We use induction on m and n: we assume that the 
SVD exists for (m—1)-by-(n—1) matrices and prove it for m-by-n. We assume 
A = 0; otherwise we can take X = 0 and let U and V be arbitrary orthogonal 
matrices. 

The basic step occurs when n = 1 (since m > n). We write A = UNV 
with U = A/||Allo, E = ||All2, and V = 1. 

For the induction step, choose v so ||v||z2 = 1 and ||Al|2 = || Av||z2 > 0. Such 
a v exists by the definition of ||Al]z = max),),— || Av|l2. Let u = Te which 
is a unit vector. Choose U and V to that U = |u, U] is an m-by-n orthogonal 
matrix, and V = [v, V] is an n-by-n orthogonal matrix. Now write 


T T T AV 
Pepin UE Ml i msq _ | w Av w AV 
utav =| r |A- V]= R |° 

Then 


Tay (ATA) _ [40 
[áve ~ [Avl 


= ||Avll2 = ||All2 = o 


and UT Av = UT ul|Av|lz = 0. We claim u? AV = 0 too because otherwise 
c = ||Allg = UTAV lo > ||[1,0,..., OUT AV lo = ||[c]u2 AV] |l2 > c, a contra- 
diction. (We have used part 7 of Lemma 1.7.) 


So UTAV = | K ar 4g ] Ta ]. We may now apply the induction 


lo å 
hypothesis to A to get A = UX Vf, where U; is (m — 1)-by-(n — 1), Dy is 
(n — 1)-by-(n — 1), and Vj is (n — 1)-by-(n — 1). So 


T 
me |o 0 f1 0 o 0 1 0 
uTav =| Reels BAe Xi 0 v 


(foal [oa] Cn) 


which is our desired decomposition. 
The SVD has a large number of important algebraic and geometric prop- 
erties, the most important of which we state here. 


or 


THEOREM 3.3. Let A = UXV™ be the SVD of the m-by-n matrix A, where 
m >n. (There are analogous results for m < n.) 


1. Suppose that A is symmetric, with eigenvalues A; and orthonormal eigen- 
vectors uj. In other words A = UAUT is an eigendecomposition of A, 
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. Let =| 


with A = diag(\1,...,An), U = [u1,..., Un], and UU? = I. Then an 
SVD of A is A= USV", where o; = |A;| and vi = sign(A;)u;, where 
sign(0) = 1. 


. The eigenvalues of the symmetric matric AT A are o?. The right singular 


vectors vi are corresponding orthonormal eigenvectors. 


. The eigenvalues of the symmetric matriz AAT are o? and m—n zeroes. 


The left singular vectors u; are corresponding orthonormal eigenvectors 
for the eigenvalues o2. One can take any m —n other orthogonal vectors 
as eigenvectors for the eigenvalue 0. 


T 
a ] where A is square and A = UEV" is the SVD of A. 


Let = diag(01,...,0n), U = [u1,..., Un] and V = [v1,..., Un]. Then 
the 2n eigenvalues of H are to;, with corresponding unit eigenvectors 


Tti 


. If A has full rank, the solution of ming || Ax — b||2 is 2 = VX~!UD. 


. ||Allg = 01. If A is also square and nonsingular, then ||A7"\|>' = on and 


Alle: A72 = Z. 


. Suppose 0, > +++ > Op > Org, = +++ = On = 0. Then the rank of A isr. 


The null space of A, i.e., the subspace of vectors v such that Av = 0, is 
the space spanned by columns r +1 through n of V: span(vp+41,.--,Un)- 
The range space of A, the subspace of vectors of the form Aw for all w, 
is the space spanned by columns 1 through r of U: span(u1,..., ur). 


. Let S"~! be the unit sphere in R": S"-1 = {x € R° : |lz|lg = 1}. 


Let A- S"-1 be the image of S” under A: A-S"-1= {Ax : we 
R” and ||z||2 = 1}. Then A- S"~! is an ellipsoid centered at the origin 
of R”, with principal axes ciui. 


. Write V = [v1,v2,...,Un] and U = [uy,ue,..., Un], so A = UNV? = 


Xi ciuu? (a sum of rank-1 matrices). Then a matrix of rank k < 
closest to A (measured with ||-||2) is Ak = yo o;ujvt , and ||A—Ax|l2 


4? 


3 


On41. We may also write Ay = UX VT, where Xy = diag(o1,...,0%,0,... 


Proof. 


1. 


2. 


This is true by the definition of the SVD. 


ATA =V>XUTUDV! = VX?VT. This is an eigendecomposition of AT A, 
with the columns of V the eigenvectors and the diagonal entries of ©? 
the eigenvalues. 
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. Choose an m-by-(m—n) matrix U so that [U, U] is square and orthogonal. 


Then write 


AAT = UEVTVEUT = UX?UT = |v, | | A 5 | |v, a). 


This is an eigendecomposition of AA’. 


. See Question 3.14. 


. || Ax — b||2 = |UE VT z — bll. Since A has full rank, so does £, and thus 


X is invertible. Now let [U, U] be square and orthogonal as above so 


2 
IUEVTz — b||? 


I 
g 
4 
T 
M 
< 
Ky 
8 
l 
S 


O DVT — UTb 
7 —UTb 


2 
= |EV" - UTS + |]07 3. 


This is minimized by making the first term zero, i.e., x = V7'U?D. 


. It is clear from its definition that the two-norm of a diagonal matrix is 


the largest absolute entry on its diagonal. Thus, by part 3 of Lemma 1.7, 
Alla = UTAV |l2 = [2 ]l2 = 01 and ||A“||2 = ||V7 ATU 2 = |B |I2 = 


Gace 


. Again choose an m-by-(m — n) matrix U so that the m-by-m matrix 


U = [U,U] is orthogonal. Since U and V are nonsingular, A and ÛT AV = 
gim—nxn = © have the same rank—namely, r—by our assumption 


about X. Also, v is in the null space of A if and only if V/v is in the null 
space of ÛT AV = È, since Av = 0 if and only if ÛT AV (VTv) = 0. But 
the null space of È is clearly spanned by columns r + 1 through n of the 
n-by-n identity matrix In, so the null space of A is spanned by V times 
these columns, i.e., v-+1 through vn. A similar argument shows that the 
range space of A is the same as U times the range space of UT AV = È, 
i.e., U times the first r columns of Im, or u1 through ur. 


. We “build” the set A-.S"~! by multiplying by one factor of A = UV" 


at a time. The figure below illustrates what happens when 


3 1 
Aes 
9-1/2 _9-1/2 4 0 9-1/2 _9-1/2 17 
| 9-1/2 RE eee 9-1/2 


= uve, 
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Assume for simplicity that A is square and nonsingular. Since V is 
orthogonal and so maps unit vectors to other unit vectors, VT . 9”71 = 
S"-1, Next, since v € S"~! if and only if ||ull2 = 1, w € XS”: if and 
only if |E twl] = 1 or Xt; (w;/oi)? = 1. This defines an ellipsoid with 
principal axes ciei, where e; is the ith column of the identity matrix. 
Finally, multiplying each w = Nv by U just rotates the ellipse so that 
each e; becomes u;, the ith column of U. 


S (=S) V*S 
4 4 
2 2 
aa a 
=9 -2 
—4 —4 
—4 -2 0 2 4 -4 -2 0 2 4 
Sigma*V*S U*Sigma*V"*S 
4 4 
2 2 
0 0 
= -2 
—4 —4 
—4 -2 0 2 4 -4 -2 0 2 4 


9. A; has rank k by construction and 


| 
Ok 
U T Vl Sean: 


gle 


It remains to show that there is no closer rank k matrix to A. Let B 
be any rank k matrix, so its null space has dimension n — k. The space 
spanned by {v1,...,U%41} has dimension k + 1. Since the sum of their 
dimensions is (n — k) + (k + 1) > n, these two spaces must overlap. Let 
h be a unit vector in their intersection. Then 


lA- B|} > |\(A— B)Alg = I| Arl|2 = ]ULV" hp 
ZV" h) 
> oR i\IV" hll 


n 
ò riuw? 


i=k+1 


|A — Aglle = 


2 
= Ok+1- 
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EXAMPLE 3.4. We illustrate the last part of Theorem 3.3 by using it for image 
compression. In particular, we will illustrate it with low-rank approximations 
of a clown. An m-by-n image is just an m-by-n matrix, where entry (i, 7) is 
interpreted as the brightness of pixel (i, 7). In other words, matrix entries rang- 
ing from 0 to 1 (say) are interpreted as pixels ranging from black (=0) through 
various shades of gray to white (=1). (Colors also are possible.) Rather than 
storing or transmitting all m-n matrix entries to represent the image, we often 
prefer to compress the image by storing many fewer numbers, from which we 
can still approximately reconstruct the original image. We may use Part 9 of 
Theorem 3.3 to do this, as we now illustrate. 

Consider the image at the top left of Figure 3.3. This 320-by-200 pixel 
image corresponds to a 320-by-200 matrix A. Let A = UV" be the SVD of 
A. Part 9 of Theorem 3.3 tells us that A, = Need cjut is the best rank-k 
approximation of A, in the sense of minimizing || A — Ag||2 = o%41. Note that 
it only takes m-k +n-k = (m+ n)-k words to store uı through uz and 
ov, through okuk, from which we can reconstruct Az. In contrast, it takes 
m-n words to store A (or A; explicitly), which is much larger when k is 
small. So we will use A; as our compressed image, stored using (m + n)-k 
words. The other images in Figure 3.3 show these approximations for various 
values of k, along with the relative errors o,41/0, and compression ratios 
(m+n)-k/(m-n) = 520-k/64000 ~ k/123. 


| k | Relative error = o441/0% | Compression ratio = 520k /64000 
3 .155 .024 
10 O77 081 
20 .040 163 


These images were produced by the following commands (the clown and 
other images are available in Matlab among the visualization demonstration 
files; check your local installation for location): 


load clown.mat; [U,S,V]=svd(X); colormap(’ gray’); 
image (UC: ,1:k)*S(1:k,1:k)*V(: ,1:k)’) 


There are also many other, cheaper image-compression techniques available 
than the SVD [187, 150]. o 


Later we will see that the cost of solving a least squares problem with the 
SVD is about the same as with QR when m >> n, and about 4n?m — $n3 + 
O(n?) for smaller m. A precise comparison of the costs of QR and the SVD 
also depends on the machine being used. See section 3.6 for details. 


DEFINITION 3.1. Suppose that A is m-by-n with m > n and has full rank, with 
A=QR=UDXV! being A’s QR decomposition and SVD, respectively. Then 


At = (ATA) AT = R'Q" = VAUT 
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Fig. 3.3. Image compression using the SVD. (a) Original image. (b) Rank k = 3 
approximation. 
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k =10, relative error =0.07662, compression =0.08141 
- : = mae 


200 
50 100 150 200 250 300 


(c) 
k =20, relative error =0.04031, compression =0.1628 


i 


50 100 150 200 250 300 


Fig. 3.3. Continued. (c) Rank k = 10 approximation. (d) Rank k = 20 approzima- 
tion. 
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is called the (Moore-Penrose) pseudoinverse of A. If m < n, then AT = 
AP AAT 


The pseudoinverse lets us write the solution of the full-rank, overdeter- 
mined least squares problem as simply x = Atb. If A is square and full rank, 
this formula reduces to x = A~!b as expected. The pseudoinverse of A is 
computed as pinv(A) in Matlab. When A is not full rank, the Moore-Penrose 
pseudoinverse is given by Definition 3.2 in section 3.5. 


3.3. Perturbation Theory for the Least Squares Problem 
When A is not square, we define its condition number with respect to the 


2-norm to be k2(A) = Omax(A)/Omin(A). This reduces to the usual condition 
number when A is square. The next theorem justifies this definition. 


THEOREM 3.4. Suppose that A is m-by-n with m > n and has full rank. Sup- 
pose that x minimizes || Ax—b||2. Let r = b— Az be the residual. Let & minimize 


aà ôA ôb Omin A 
i + dA)& — (b + ôb)||2. Assume € = max( Hele, 12) í E = Sona 
en 
t — 2. K2(A 
E= <e. {a +tané@- Hay} + O(c?) =e- k3 + O(c’), 
where sin@ = HÈ- In other words, 0 is the angle between the vectors b and 


Ax and measures whether the residual norm ||r||2 is large (near ||b||) or small 
(near 0). krs is the condition number for the least squares problem. 


Sketch of Proof. Expand @ = ((A+6A)?(A+ 5A)) (A + 5A)" (b + ôb) in 
powers of 6A and 6b, and throw away all but the linear terms in 6A and ôb. 


We have assumed that €-#2(A) < 1 for the same reason as in the derivation 
of bound (2.4) for the perturbed solution of the square linear system Ax = b: 
it guarantees that A + 6A has full rank so that % is uniquely determined. 

We may interpret this bound as follows. If 0 is 0 or very small, then the 
residual is small and the effective condition number is about 2K2(A), much like 
ordinary linear equation solving. If @ is not small but not close to 7/2, the 
residual is moderately large, and then the effective condition number can be 
much larger: «3(A). If @ is close to 7/2, so the true solution is nearly zero, 
then the effective condition number becomes unbounded even if k2(A) is small. 
These three cases are illustrated below. The rightmost picture makes it easy 
to see why the condition number is infinite when 0 = 7/2: in this case the 
solution x = 0, and almost any arbitrarily small change in A or b will yield a 
nonzero solution x, an “infinitely” large relative change. 
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k_LS = 2k(A) k_LS = O(K(A)”) k_LS = infinity 
An alternative form for the bound in Theorem 3.4 that eliminates the O(c?) 
term is as follows [256, 147] (here 7 is the perturbed residual 7 = (b + ôb) — 
(A+ ôA)ž): 


Lone aal ( irll ) 
——— < mm —_/[(24+ (Kk A)+1 ’ 
lel fan) ale 

F= 


< (1+2ek2(A)). 
lrll2 

We will see that, properly implemented, both the QR decomposition and 
SVD are numerically stable; i.e., they yield a solution % minimizing ||(A + 


SA) — (b + ôb)||2 with 
ISAI a) 

ag | EA Wel) SO; 

Ga ja) ae 


We may combine this with the above perturbation bounds to get error bounds 
for the solution of the least squares problem, much as we did for linear equation 
solving. 

The normal equations are not as accurate. Since they involve solving 
(A A)x = ATb, the accuracy depends on the condition number K2(A? A) = 
«3(A). Thus the error is always bounded by on «3(A)e, never just K2(A)e. 
Therefore we expect that the normal equations can lose twice as many digits 
of accuracy as methods based on the QR decomposition and SVD. 

Furthermore, solving the normal equations is not necessarily stable; i.e., 
the computed solution č does not generally minimize ||(A + 6A)% — (b + ôb) ||2 
for small 6A and ôb. Still, when the condition number is small, we expect 
the normal equations to be about as accurate as the QR decomposition or 
SVD. Since the normal equations are the fastest way to solve the least squares 
problem, they are the method of choice when the matrix is well-conditioned. 

We return to the problem of solving very ill-conditioned least squares prob- 
lems in section 3.5. 


3.4. Orthogonal Matrices 


As we said in section 3.2.2, Gram-Schmidt orthogonalization (Algorithm 3.1) 
may not compute an orthogonal matrix Q when the vectors being orthogonal- 
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ized are nearly linearly dependent, so we cannot use it to compute the QR 
decomposition stably. 


Instead, we base our algorithms on certain easily computable orthogonal 
matrices called Householder reflections and Givens rotations, which we can 
choose to introduce zeros into vectors that they multiply. Later we will show 
that any algorithm that uses these orthogonal matrices to introduce zeros 
is automatically stable. This error analysis will apply to our algorithms for 
the QR decomposition as well as many SVD and eigenvalue algorithms in 
Chapters 4 and 5. 


Despite the possibility of nonorthogonal Q, the MGS algorithm has im- 
portant uses in numerical linear algebra. (There is little use for its less stable 
version, CGS.) These uses include finding eigenvectors of symmetric tridiagonal 
matrices using bisection and inverse iteration (section 5.3.4) and the Arnoldi 
and Lanczos algorithms for reducing a matrix to certain “condensed” forms 
(sections 6.6.1, 6.6.6, and 7.4). Arnoldi and Lanczos are used as the basis of 
algorithms for solving sparse linear systems and finding eigenvalues of sparse 
matrices. MGS can also be modified to solve the least squares problem stably, 
but Q may still be far from orthogonal [33]. 


3.4.1. Householder Transformations 


A Householder transformation (or reflection) is a matrix of the form P = 
I — 2uu? where |lu] = 1. It is easy to see that P = PT and PP? = (I — 
2uu?)(I —2uu?) = I —4uu? + 4uut uu? = I, so P is a symmetric, orthogonal 
matrix. It is called a reflection because Pz is reflection of x in the plane 
through 0 perpendicular to u. 


Px 


Given a vector x, it is easy to find a Householder reflection P = I — 2uu™ 
to zero out all but the first entry of x: Px = [c,0,...,0])’ = c- e1. We do 
this as follows. Write Px = x — 2u(u? x) = c- e1 so that u = zarz — cea); 
i.e., u is a linear combination of x and e1. Since ||æ||2 = ||Pz||2 = |c|, u must 
be parallel to the vector ŭ = x + ||z||2e1, and so u = &/||ŭ]|2. One can verify 
that either choice of sign yields a u satisfying Px = ce;, as long as i = 0. We 
will use & = x + sign(a1)e1, since this means that there is no cancellation in 
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computing the first component of t. In summary, we get 


xı + sign(21) - |læll2 
T2 ù 


Qe 
II 
=. 
— 
5 
< 
II 


Tn 


We write this as u = House(x). (In practice, we can store & instead of u to save 


the work of computing u, and use the formula P = I — (2/||&||3)aa7 instead 
of P = I — 2uu .) 


EXAMPLE 3.5. We show how to compute the QR decomposition of a a 5-by- 
4 matrix A using Householder transformations. This example will make the 
pattern for general m-by-n matrices evident. In the matrices below, P; is a 
5-by-5 orthogonal matrix, x denotes a generic nonzero entry, and o denotes a 
zero entry. 


Ta gg 
ow a g 
1. Choose P; so A,=PA=|o02nr & 2 
ox © g 
ox £2 2 
“Le TtT 
oO tt 

1 

2. Choose Py = Era so A= PA=] o0o ozzl. 
2 002 z 
002 £ 
nn nn ne 
1 0 OAL E 

3. Choose P; = 1 so Ag3=P3A9=|0 0824 2]. 
0 | P; 0008 

| 000 2 | 
1 E a ne 
1 0 ox © g 
4. Choose Py = 1 so Ayg=PjA3=] 0 02 2 
0 | Pi | Oo O O a 

0 0 o o 


Here, we have chosen a Householder matrix P; to zero out the subdiago- 
nal entries in column 7; this does not disturb the zeros already introduced in 
previous columns. 

Let us call the final 5-by-4 upper triangular matrix R = Ay. Then A= 
PEPA PI PT R = QR, where Q is the first four columns of PEPI Pe PS 
P, P2P3P, (since all P; are symmetric) and R is the first four rows of R. o 
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Here is the general algorithm for QR decomposition using Householder 
transformations. 


ALGORITHM 3.2. QR factorization using Householder reflections: 


fori=1 ton 
u; = House(A(i : m, i)) 


P,=I- Quyut 
A(Zi:m,i:n) = P,A(Zi: m,i:n) 
end for 


Here are some more implementation details. We never need to form P; 
explicitly but just multiply 


(I — 2ujut)A(G: m,i: n) = Ai: m,i: n) — 2u;(uf A(i : m,i: n)), 


which costs less. To store P;, we need only u;, or & and ||ŭ:l|. These can 
be stored in column i of A; in fact it need not be changed! Thus QR can be 
“overwritten” on A, where Q is stored in factored form P} --- Pn-1, and P; is 
stored as ŭ; below the diagonal in column 7 of A. (We need an extra array of 
length n for the top entry of ŭ;, since the diagonal entry is occupied by Rii.) 

Recall that to solve the least squares problem min || Ax —6||2 using A = QR, 
we need to compute QTb. This is done as follows: QTb = P,Py_1---P,b, so 
we need only keep multiplying b by Pi, Po,..., Ph: 


fori=1 ton 


y= —2. ufb 
b= b+ yu; 
end for 


The cost is n dot products y = —2- ub and n “saxpys” b+ yui. The cost 
of computing A = QR this way is 2n?m — $n’, and the subsequent cost of 
solving the least squares problem given QR is just an additional O(mn). 

The LAPACK routine for solving the least squares problem using QR is 
sgels. Just as Gaussian elimination can be reorganized to use matrix-matrix 
multiplication and other Level 3 BLAS (see section 2.6), the same can be done 
for the QR decomposition; see Question 3.17. In Matlab, if the m-by-n matrix 
A has more rows than columns and b is m by 1, A\b solves the least squares 
problem. The QR decomposition itself is also available via [Q,R]=qr(A). 


3.4.2. Givens Rotations 


cos —sin@ 
sin 6 cos 0 


A Givens rotation R(@) = [ 
clockwise by 8: 


] rotates any vector x € R? counter- 


122 Applied Numerical Linear Algebra 


R(6) x 
K) 


x 


We also need to define the Givens rotation by 0 in coordinates 7 and 7: 


a j 
j | 
1 
| | | 
a ren | | cos 0 —sin 0 | 
R(i, j,0) = . 
j sin 0 cos 0 | 


Given x, i, and j, we can zero out x; by choosing cos @ and sin @ so that 
cos@ —sin@ zil e + r? 
sinô cos y 0 
=i and sin 0 = ~. 
rec and sin ere 
The QR algorithm using Givens rotations is analogous to using Householder 


reflections, but when zeroing out column 7, we zero it out one entry at a time 
(bottom to top, say). 


or cos ĝ = 


EXAMPLE 3.6. We illustrate two intermediate steps in computing the QR de- 
composition of a 5-by-4 matrix using Givens rotations. To progress from 


D Ce E De Le 
O TTT Oo x xX 2 
o o x x {to} 0 0 “4 2 
oO 0 TT oOo 0 0 2 
Bee | aasan 
we multiply 
1 Be eo ux x2 i 
1 W eee 
1 oo x2 £/=!]o0 0% 2 
c -s Oo 0 x£ 2& 00x £ 
S c Oo 0 XL 2& 000 © 
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and 


dT 
— 

— 

a, 

l 

Pes 
_—_____| 
TT 
S- OS & 8 


Ree ee 
P 
| 
IF ~ 
o X 

oo 8 8 


SO e 
On 88 8 8 
oo 8 8 8 


B 


The cost of the QR decomposition using Givens rotations is twice the cost 
of using Householder reflections. We will need Givens rotations for other ap- 
plications later. 

Here are some implementation details. Just as we overwrote A with Q 
and R when using Householder reflections, we can do the same with Givens 
rotations. We use the same trick, storing the information describing the trans- 
formation in the entries zeroed out. Since a Givens rotation zeros out just one 
entry, we must store the information about the rotation there. We do this as 


follows. Let s = sin@ and c = cos0. If |s| < |c|, store s - sign(c) and other- 
sign(s) 
© 


wise store . To recover s and c from the stored value (call it p) we do 
the following: if |p| < 1, then s = p and c = v1 -— s?; otherwise c = 1 and 
s = y1 — c2. The reason we do not just store s and compute c = y1 -— s? is 
that when s is close to 1, c would be inaccurately reconstructed. Note also 
that we may recover either s and c or —s and —c; this is adequate in practice. 

There is also a way to apply a sequence of Givens rotations while perform- 
ing fewer floating point operations than described above. These are called Fast 
Givens rotations [7, 8, 33]. Since they are still slower than Householder reflec- 
tions for the purposes of computing the QR factorization, we will not consider 
them further. 


3.4.3. Roundoff Error Analysis for Orthogonal Matrices 


This analysis proves backward stability for the QR decomposition and for many 
of the algorithms for eigenvalues and singular values that we will discuss. 


LEMMA 3.1. Let P be an exact Householder (or Givens) transformation, and 
P be its floating point approximation. Then 


(PA) = P(A+E) ||Ell2 = O(¢) - ||All2 


and 
fi(AP) =(A+F)P_ ||Fll2 = O(€) - Alle 


Sketch of Proof. Apply the usual formula fi(a © b) = (a © b)(1 + £) to the 
formulas for computing and applying P. See Question 3.16. 

In words, this says that applying a single orthogonal matrix is backward 
stable. 
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THEOREM 3.5. Consider applying a sequence of orthogonal transformations 
to Ag. Then the computed product is an exact orthogonal transformation of 
Ag + 6A, where ||OAl|2 = O(e)||All2. In other words, the entire computation is 
backward stable: 


fi(P; P;_1--- PAQOQ- Qj) = Pj Pi(Ao + E)Qi---Q; 


with ||Ell2 = 7-O(e)- ||All2. Here, as in Lemma 3.1, P; and Qi are floating 
point orthogonal matrices and P; and Q; are exact orthogonal matrices. 


Proof. Let P; = P;---P, and Q; = Qı: Qj. We wish to show that 
A; = fl(Pj)Aj-1Q;) = P)(A + E;)Q; for some ||Ej||2 = jO(€)||All2. We use 
Lemma 3.1 recursively. The result is vacuously true for 7 = 0. Now assume 
that the result is true for 7 — 1. Then we compute 


B = f(P)Aj-1) 

P;(Aj-1 + E’) by Lemma 3.1 

P;(P)-1(A + Ej-1)Qj-1 + E’) by induction 
Pj(A+ Ej- + PL E'Q5_1)Q;1 

=) RAFE 


where 


lE” l2 lE; + PLE Qile < |El + IPLE Q;ll2 
lE;-ll2 + Elle 


= jO(e)||All2 


since ||Ej-1/2 = (j — 1)O(e)||All2 and ||E’|/2 = O(e)||All2. Postmultiplication 
by Q; is handled in the same way. 


3.4.4. Why Orthogonal Matrices? 


Let us consider how the error would grow if we were to multiply by a sequence of 
nonorthogonal matrices in Theorem 3.5 instead of orthogonal matrices. Let X 
be the exact nonorthogonal transformation and X be its floating point approx- 
imation. Then the usual floating point error analysis of matrix multiplication 
tells us that 


fi(XA)=XA+E=X(A+X 1B) = X(A+FP), 


where ||E||2 < O(e)||X|l2- ||Allz and so ||Fll2 < ||X7"|l2- |Ell2 < Ole) - k2(X)- 
|| All2- 

So the error ||E||2 is magnified by the condition number K2(X) > 1. Ina 
larger product Xx +- X1 AYı +- Yp the error would be magnified by ||, k2(X;) - 
k2(Y;). This factor is minimized if and only if all X; and Y; are orthogonal (or 
scalar multiples of orthogonal matrices), in which case the factor is one. 
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3.5. Rank-Deficient Least Squares Problems 


So far we have assumed that A has full rank when minimizing ||Azx — 6]l2. 
What happens when A is rank deficient or “close” to rank deficient? Such 
problems arise in practice in many ways, such as extracting signals from noisy 
data, solution of some integral equations, digital image restoration, comput- 
ing inverse Laplace transforms, and so on [139, 140]. These problems are 
very ill-conditioned, so we will need to impose extra conditions on their so- 
lutions to make them well-conditioned. Making an ill-conditioned problem 
well-conditioned by imposing extra conditions on the solution is called regular- 
ization and is also done in other fields of numerical analysis when ill-conditioned 
problems arise. 

For example, the next proposition shows that if A is exactly rank deficient, 
then the least squares solution is not even unique. 


PROPOSITION 3.1. Let A be m-by-n with m > n and rank A=r<n. Then 
there is ann —r dimensional set of vectors x that minimize || Ax — b||2. 


Proof. Let Az = 0. Then if x minimizes ||Ax — b||2, so does x + z. 

Because of roundoff in the entries of A, or roundoff during the computation, 
it is most often the case that A will have one or more very small computed 
singular values, rather than some exactly zero singular values. The next propo- 
sition shows that in this case, the unique solution is likely to be very large and is 
certainly very sensitive to error in the right-hand side b (see also Theorem 3.4). 


PROPOSITION 3.2. Let Omin = Omin(A), the smallest singular value of A. As- 
sume Omin > 0. Then 


1. if x minimizes || Ax — b|l2, then ||æll2 > |utb|/omin, where un is the last 
column of U in A = UEVT. 


2. changing b to b+ ôb can change x to x+ 6x, where ||ôx||2 is as large as 
|[ðb]|2/Tmin- 


In other words, if A is nearly rank deficient (omin is small), then the solu- 
tion x is ill-conditioned and possibly very large. 


Proof. For part 1, x = Atb = VETIUTÐ, so |jæl2 = ||ETtUTol > 
I(ETIUTb)n| = [uZ b| /Cmin. For part 2, choose 6b parallel to un. 

We begin our discussion of regularization by showing how to regularize 
an exactly rank-deficient least squares problem: Suppose A is m-by-n with 
rank r < n. Within the (n — r)-dimensional solution space, we will look for 
the unique solution of smallest norm. This solution is characterized by the 
following proposition. 
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PROPOSITION 3.3. When A is exactly singular, the x that minimize || Ax — lla 


can be characterized as follows. Let A= USV™ have rank r < n, and write 
the SVD of A as 


>; 0 


A = [U1, U9] | 0 0 


| [Vi, Vel? = UX VE, (3.1) 


where Xı is r x r and nonsingular and U, and Vı have r columns. Let o = 
Omin(1), the smallest nonzero singular value of A. Then 


1. All solutions x can be written x = Vir UTb+ Vz, z an arbitrary vector. 


2. The solution x has minimal norm ||z||2 precisely when z = 0, in which 
case x = VE UPd and |\a|l2 < ||b|l2/c. 


3. Changing b to b+ ôb can change the minimal norm solution x by at most 


llôb||2/0. 


In other words, the norm and condition number of the unique minimal norm 
solution x depend on the smallest nonzero singular value of A. 


Proof. Choose U so [U,U] = [U1, U2, Ŭ] is an m x m orthogonal matrix. Then 


lAz-ebli = |[U, Ü] (Az — b) |3 

UF 

= UT | (UV z -— b) 
UL 

= UTb 
UT 

= |EV x- UP dlls + UZO? + | ls. 


2 


2 
2 


1. || Ax — blo is minimized when X1 Vf x = UP, or « = Vid UTO + Voz 
since VË V2z = 0 for all z. 


2. Since the columns of V; and Vz are mutually orthogonal, the Pythagorean 
theorem implies that ||a||3 = |V ET UTO]? + ||Vez||3, and this is mini- 
mized by z = 0. 


3. Changing b by 6b changes x by at most | V1 £T UT b|| < IIET" Il2llőbll2 = 
||dd||2/0. 


Proposition 3.3 tells us that the minimum norm solution x is unique and 
may be well-conditioned if the smallest nonzero singular value is not too small. 
This is key to a practical algorithm, discussed in the next section. 
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EXAMPLE 3.7. Suppose that we are doing medical research on the effect of a 
certain drug on blood sugar level. We collect data from each patient (num- 
bered from i = 1 to m) by recording his or her initial blood sugar level (a;,1) 
and final blood sugar level (b;), the amount of drug administered (a;,2), and 
other medical quantities, including body weights on each day of a week-long 
treatment (a;,3 through a;9). In total, there are n < m medical quantities 
measured for each patient. Our goal is to predict b; given a; through ain, 
and we formulate this as the least squares problem min, || Az —b||2. We plan to 
use x to predict the final blood sugar level b; of future patient 7 by computing 
the dot product X` p—1 @jxXr.- 

Since people’s weight generally does not change significantly from day to 
day, it is likely that columns 3 through 9 of matrix A, which contain the 
weights, are very similar. For the sake of argument, suppose that columns 
3 and 4 are identical (which may be the case if the weights are rounded to 
the nearest pound). This means that matrix A is rank deficient and that 
xo = [0,0,1,—-1,0,...,0]7 is a right null vector of A. So if x is a (minimum 
norm) solution of the least squares problem min, || Ax — b||2, then z + G29 is 
also a (nonminimum norm) solution for any scalar (3, including, say, 3 = 0 and 
B = 10°. Is there any reason to prefer one value of 3 over another? The value 
10° is clearly not a good one, since future patient j, who gains one pound 
between days 1 and 2, will have that difference of one pound multiplied by 
10° in the predictor Xg; ajk£k Of final blood sugar level. It is much more 
reasonable to choose 3 = 0, corresponding to the minimum norm solution zx. 
© 


For further justification of using the minimum norm solution for rank- 
deficient problems, see [139, 140]. 

When A is square and nonsingular, the unique solution of Ax = b is of 
course b = A-tz. If A has more rows than columns and is possibly rank- 
deficient, the unique minimum-norm least squares solution may be similarly 
written b = Atb, where the Moore-Penrose pseudoinverse A* is defined as 
follows: 


DEFINITION 3.2. (Moore-Penrose pseudoinverse A* for possibly rank-deficient 
A) 
Let A= USV? = UX VË as in equation (3.1). Then At = VEUT. 


. . . +p P? + feces Xi 0 EL are 0 
This is also written AT = V+ XU, where XT = | 0 eae [ 0 rae 


So the solution of the least squares problem is always x = Atb, and when 
A is rank deficient, x has minimum norm. 


3.5.1. Solving Rank-Deficient Least Squares Problems Using the 
SVD 


Our goal is to compute the minimum norm solution x, despite roundoff. In 
the last section, we saw that the minimal norm solution was unique and had a 
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condition number depending on the smallest nonzero singular value. Therefore, 
computing the minimum norm solution requires knowing the smallest nonzero 
singular value and hence also the rank of A. The main difficulty is that the 
rank of a matrix changes discontinuously as a function of the matrix. 


For example, the 2-by-2 matrix A = diag(1,0) is exactly singular, and its 
smallest nonzero singular value is ø = 1. As described in Proposition 3.3, the 
minimum norm least squares solution to min, ||Ax — b|| with b = [1,1]? is 
x = [1,0]", with condition number 1/o = 1. But if we make an arbitrarily 
tiny perturbation to get A = diag(1,¢), then o drops to € and x = [1,1/e]T 
becomes enormous, as does its condition number 1/e. In general, roundoff will 
make such tiny perturbations, of magnitude O(e)||Al/2. As we just saw, this 
can increase the condition number from 1/ø to 1/e. 


We deal with this discontinuity algorithmically as follows. In general each 
computed singular value 6; satisfies |G;—0;| < O(e€)||All2. This is a consequence 
of backward stability: the computed SVD will be the exact SVD of a slightly 
different matrix: A = ULV? = A+ ôA, with ||6Al] = O(e) - ||Al]. (This is 
discussed in detail in Chapter 5.) This means that any 6; < O(e)||All2 can 
be treated as zero, because roundoff makes it indistinguishable from 0. In the 
above 2-by-2 example, this means we would set the € in A to zero before solving 
the least squares problem. This would raise the smallest nonzero singular value 
from € to 1 and correspondingly decrease the condition number from 1/e to 


1/o =1. 


More generally, let tol be a user-supplied measure of uncertainty in the data 
A. Roundoff implies that tol > €- || A||, but it may be larger, depending on the 
source of the data in A. Now set õ; = 6; if 6; > tol, and õ; = 0 otherwise. Let 
Š = diag(õ;). We call UNV" the truncated SVD of A, because we have set 
singular values smaller than tol to zero. Now we solve the least squares problem 
using the truncated SVD instead of the original SVD. This is justified since 
||VEVT — UV" |p = ||Û(È — ¥)V7 2 < tol, i.e., the change in A caused by 
changing each 6; to o; is less than the user’s inherent uncertainty in the data. 
The motivation for using © instead of È is that of all matrices within distance 
tol of È, © maximizes the smallest nonzero singular value a. In other words, it 
minimizes both the norm of the minimum norm least squares solution x and its 
condition number. The picture below illustrates the geometric relationships 
among the input matrix A, A= ÛSVT, and A = ÛSVT, where we we think 
of each matrix as a point in Euclidean space R™”. In this space, the rank- 
deficient matrices form a surface, as shown below: 
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\ 


Rank-deficient matrices 


EXAMPLE 3.8. We illustrate the above procedure on two 20-by-10 rank-deficient 
matrices A, (of rank rı = 5) and Ag (of rank rə = 7). We write the SVDs of 
either A; or Ag as A; = Ui XV, where the common dimension of U;, X;, and 
V; is the rank r; of A;; this is the same notation as in Proposition 3.3. The r; 
nonzero singular values of A; (singular values of X;) are shown as red pluses 
in Figure 3.4 (for A;) and Figure 3.5 (for Az). Note that A; in Figure 3.4 has 
five large nonzero singular values (all slightly exceeding 1 and so plotted on 
top of one another, on the right edge the graph), whereas the seven nonzero 
singular values of Ag in Figure 3.5 range down to 1.2- 107° & tol. 

We then choose an r;-dimensional vector x, and let x; = Vj) and b; = 
Aixi = U;X;2',, so x; is the exact minimum norm solution minimizing ||A;x; — 
bill2. Then we consider a sequence of perturbed problems A; + 6A, where the 
perturbation 6A is chosen randomly to have a range of norms, and solve the 
least squares problems ||(A; + 6A)y; — };||2 using the truncated least squares 
procedure with tol = 107°. The blue lines in Figures 3.4 and 3.5 plot the 
computed rank of A; + 6A (number of computed singular values exceeding 
tol = 107°) versus ||6A||2 (in the top graphs), and the error ||y; — x;||2/||x;|l2 
(in the bottom graphs). The Matlab code for producing these figures is in 
HOMEPAGE/Matlab/RankDeficient.m. 

The simplest case is in Figure 3.4, so we consider it first. A; + 6A will 
have five singular values near or slightly exceeding 1 and the other five equal 
to ||OAllo or less. For ||OAl/2 < tol, the computed rank of A; + 6A stays the 
same as that of Aı, namely, 5. The error also increases slowly from near 
machine epsilon (~ 10716) to about 10~!° near ||5.A|/2 = tol, and then both 
the rank and the error jump, to 10 and 1, respectively, for larger ||6A||2. This 
is consistent with our analysis in Proposition 3.3, which says that the condition 
number is the reciprocal of the smallest nonzero singular value, i.e., the smallest 
singular value exceeding tol. For ||5.Al|2 < tol, this smallest nonzero singular 
value is near to, or slightly exceeds, 1. Therefore Proposition 3.3 predicts an 
error of ||6A||2/O(1) = ||OAll2. This well-conditioned situation is confirmed 
by the small error plotted to the left of |]6A||2 = tol in the bottom graph of 
Figure 3.4. On the other hand, when ||ĝA]|2 > tol, then the smallest nonzero 
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Rank of perturbed A, original s-values are plusses, tol=1e-09 
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Fig. 3.4. Graph of truncated least squares solution of miny, ||(41 +8A)yı —61||2, using 
tol = 107°. The singular values of A, are shown as red pluses. The norm ||5Al|z is 
the horizontal axis. The top graph plots the rank of Ay + ôA, i.e., the numbers of 
singular values exceeding tol. The bottom graph plots ||yı — v1||1/\|x1||2, where xı is 
the solution with 6A = 0. 


singular value is O(||0Al|2), which is quite small, causing the error to jump to 
|S Al|2/O(||6All2) = O(1), as shown to the right of ||6A||2 = tol in the bottom 
graph of Figure 3.4. 

In Figure 3.5, the nonzero singular values of Ag are also shown as red pluses; 
the smallest one, 1.2-10~°, is just larger than tol. So the predicted error when 
|OAll2 < tol is ||5A]|2/10~°, which grows to O(1) when ||6Al|2 = tol. This is 
confirmed by the bottom graph in Figure 3.5. © 


3.5.2. Solving Rank-Deficient Least Squares Problems Using QR 
with Pivoting 


A cheaper but sometimes less accurate alternative to the SVD is QR with 
pivoting. In exact arithmetic, if A had rank r < n and its first r columns were 
independent, then its QR decomposition would look like 


Rı Rig 
A=QR-Q| 0 0 |, 
0 0 


where Ry, is r-by-r and nonsingular and Rj is r-by-(n — r). With roundoff, 
we might hope to compute 
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Rank of perturbed A, original s—values are plusses, tol=1e-09 
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Fig. 3.5. Graph of truncated least squares solution of miny, ||(A2+6A)y2— bello, using 
tol = 107°. The singular values of A> are shown as red pluses. The norm ||5All2 is 
the horizontal axis. The top graph plots the rank of Ao + ôA, i.e., the numbers of 
singular values exceeding tol. The bottom graph plots ||y2 — xa|l2/\|xall2, where zə is 
the solution with 6A = 0. 


with ||R22||2 very small, on the order of e||A]|2. In this case we could just set 
Ro = 0 and minimize || Ax —b]|2 as follows: let [Q, Q] be square and orthogonal 
so that 


T 2 T 2 
pe-a = || Sr la- =|] op? 
= [Ra — QTO|5 + 1Q7O|l3. 
Write Q = [Q1, Q2] and z = [ i ] conformally with R = [ — ] so 


that 
|| Ax — dll} = || Rirzi + Rize — QFd||3 + |]Q3 d||3 + Qoll 


Ry! (QT b— Rizx2) ] 


is minimized by choosing z = z 
2 


for any x2. Note that the 


choice x2 = 0 does not necessarily minimize ||z||2, but it is a reasonable choice, 
especially if Ry, is well-conditioned and Ry Rio is small. 

Unfortunately, this method is not reliable since R may be nearly rank 
deficient even if no Rə is small. For example, the n-by-n bidiagonal matrix 


NIFR = 
a 
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has Omin(A) ~ 27”, but A = Q- R with Q = I and R = A, and no R22 is small. 

To deal with this failure to recognize rank deficiency, we may do QR with 
column pivoting. This means that we factorize AP = QR, P being a permuta- 
tion matrix. This idea is that at step į (which ranges from 1 to n, the number 
of columns) we select from the unfinished part of A (columns i to n and rows 
i to m) the column of largest norm and exchange it with the ith column. We 
then proceed to compute the usual Householder transformation to zero out 
column 7 in entries i+1 tom. This pivoting strategy attempts to keep Rj, as 
well-conditioned as possible and Rə as small as possible. 


EXAMPLE 3.9. If we compute the QR decomposition with column pivoting 
to the last example (.5 on the diagonal and 1 on the superdiagonal) with 
n = 11, we get R1111 = 4.23 - 1074, a reasonable approximation to Omin(A) = 
3.66-1074. Note that Ran > Omin(A) since omin(A) is the norm of the smallest 


perturbation that can lower the rank, and setting Rnn to 0 lowers rank. © 


Ran 
Omin(A) h 
imation to Omin(A). The worst case, however, is as bad as worst-case pivot 


growth in GEPP. 

More sophisticated pivoting schemes than QR with column pivoting, called 
rank-revealing QR algorithms, have been a subject of much recent study. Rank- 
revealing QR algorithms that detect rank more reliably and sometimes also 
faster than QR with column pivoting have been developed [28, 30, 47, 49, 107, 
124, 126, 148, 194, 234]. We discuss them further in the next section. 

QRD with column pivoting is available as subroutine sgeqpf in LAPACK. 
LAPACK also has several similar factorizations available: RQ (sgergf), LQ 
(sgelqf), and QL (sgeqlf). Future LAPACK releases will contain improved 
versions of QR. 


One only can show A 2”, but usually Ryn is a reasonable approx- 


3.6. Performance Comparison of Methods for Solving 
Least Squares Problems 


What is the fastest algorithm for solving dense least squares problems? As 
discussed in section 3.2, solving the normal equations is fastest, followed by 
QR and the SVD. If A is quite well-conditioned, then the normal equations are 
about as accurate as the other methods, even though they are not numerically 
stable, and may be used as well. When A is not well-conditioned but far from 
rank deficient, we should use QR. 

Since the design of fast algorithms for rank-deficient least squares problems 
is a current research area, it is difficult to recommend a single algorithm to use. 
We summarize a recent study [204] that compared the performance of several 
algorithms, comparing them to the fastest stable algorithm for the non—-rank- 
deficient case: QR without pivoting, implemented using Householder trans- 
formations as described in section 3.4.1, with memory hierarchy optimizations 
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described in Question 3.17. These comparisons were made in double preci- 
sion arithmetic on an IBM RS6000/590. Included in the comparison were the 
rank-revealing QR algorithms mentioned in section 3.5.2 and various imple- 
mentations of the SVD (see section 5.4). Matrices of various sizes and with 
various singular value distributions were tested. We present results for two 
singular value distributions: 


Type 1: random matrices, where each entry is uniformly distributed from —1 
to 1; 


Type 2: matrices with singular values distributed geometrically from 1 to € 
(in other words, the ith singular value is y’, where y is chosen so that 
=e). 

Type 1 matrices are generally well-conditioned, and Type 2 matrices are 
rank-deficient. We tested small square matrices (n = m = 20) and large 
square matrices (m = n = 1600). We tested square matrices because if m 
is sufficiently greater than n in the m-by-n matrix A, it is cheaper to do a 
QR decomposition as a “preprocessing step” and then perform rank-revealing 
QR or the SVD on R. (This is done in LAPACK.) If m > n then the initial 
QR decomposition dominates the the cost of the subsequent operations on the 
n-by-n matrix R, and all the algorithms cost about the same. 

The fastest version of rank-revealing QR was that of [30, 194]. On Type 
1 matrices, this algorithm ranged from 3.2 times slower than QR without 
pivoting for n = m = 20 to just 1.1 times slower for n = m = 1600. On Type 2 
matrices, it ranged from 2.3 times slower (for n = m = 20) to 1.2 times slower 
(for n = m = 1600). In contrast, the current LAPACK algorithm, dgeqpf, 
was 2 times to 2.5 times slower for both matrix types. 

The fastest version of the SVD was the one in [57], although one based on 
divide-and-conquer (see section 5.3.3) was about equally fast for n = m = 1600. 
(The one based on divide-and-conquer also used much less memory.) For Type 
1 matrices, the SVD algorithm was 7.8 times slower (for n = m = 20) to 3.3 
times slower (for n = m = 1600). For Type 2 matrices, the SVD algorithm was 
3.5 times slower (for n = m = 20) to 3.0 times slower (for n = m = 1600). In 
contrast, the current LAPACK algorithm, dgelss, ranged from 4 times slower 
(for Type 2 matrices with n = m = 20) to 97 times slower (for Type 1 matrices 
with n = m = 1600). This enormous slowdown is apparently due to memory 
hierarchy effects. 

Thus, we see that there is a tradeoff between reliability and speed in solv- 
ing rank-deficient least squares problems: QR without pivoting is fastest but 
least reliable, the SVD is slowest but most reliable, and rank-revealing QR 
is in-between. If m > n, all algorithms cost about the same. The choice of 
algorithm depends on the relative importance of speed and reliability to the 
user. 

Future LAPACK releases will contain improved versions of both rank- 
revealing QR and SVD algorithms for the least squares problem. 
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3.7. References and Other Topics for Chapter 3 


The best recent reference on least squares problems is [33], which also discusses 
variations on the basic problem discussed here (such as constrained, weighted, 
and updating least squares), different ways to regularize rank-deficient prob- 
lems, and software for sparse least squares problems. See also chapter 5 of 
[119] and [166]. Perturbation theory and error bounds for the least squares 
solution are discussed in detail in [147]. Rank-revealing QR decompositions 
are discussed in [28, 30, 47, 49, 124, 148, 194, 204, 234]. In particular, these 
papers examine the tradeoff between cost and accuracy in rank determination, 
and in [204] there is a comprehensive performance comparison of the available 
methods for rank-deficient least squares problems. 


3.8. Questions for Chapter 3 


QUESTION 3.1. (Easy) Show that the two variations of Algorithm 3.1, CGS 
and MGS, are mathematically equivalent by showing that the two formulas for 
rji yield the same results in exact arithmetic. 


QUESTION 3.2. (Easy) This question will illustrate the difference in nu- 
merical stability among three algorithms for computing the QR factoriza- 
tion of a matrix: Householder QR (Algorithm 3.2), CGS (Algorithm 3.1), 
and MGS (Algorithm 3.1). Obtain the Matlab program QRStability.m from 
HOMEPAGE/Matlab/QRStability.m. This program generates random matri- 
ces with user-specified dimensions m and n and condition number cnd, computes 
their QR decomposition using the three algorithms, and measures the accuracy 
of the results. It does this with the residual || A — Q - R||/||A]|, which should be 
around machine epsilon € for a stable algorithm, and the orthogonality of Q 
IQT -Q — ||, which should also be around e. Run this program for small ma- 
trix dimensions (such as m= 6 and n= 4), modest numbers of random matrices 
(samples= 20), and condition numbers ranging from cnd= 1 up to cnd= 10”. 
Describe what you see. Which algorithms are more stable than others? See 
if you can describe how large ||Q? - Q — || can be as a function of choice of 
algorithm, cnd and e€. 


QUESTION 3.3. (Medium; Hard) Let A be m-by-n, m > n, and have full rank. 


1. (Medium) Show that [ e : Jal A 


minimizes || Ax — b||2. One reason for this formulation is that we can 


apply iterative refinement to this linear system if we want a more accurate 
answer (see section 2.5). 


IESS] : ] has a solution where x 


2. (Medium) What is the condition number of the coefficient matrix, in 
terms of the singular values of A? Hint: Use the SVD of A. 
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3. (Medium) Give an explicit expression for the inverse of the coefficient 
matrix, as a block 2-by-2 matrix. Hint: Use 2-by-2 block Gaussian elim- 
ination. Where have we previously seen the (2,1) block entry? 


4. (Hard) Show how to use the QR decomposition of A to implement an 
iterative refinement algorithm to improve the accuracy of x. 


QUESTION 3.4. (Medium) Weighted least squares: If some components of Ar— 
b are more important than others, we can weight them with a scale factor d; 
and solve the weighted least squares problem min ||D(Az — b)||2 instead, where 
D has diagonal entries d;. More generally, recall that if C is symmetric positive 
definite, then ||2||¢ = (c7Cxr)'/? is a norm, and we can consider minimizing 
|| Az — bl|c. Derive the normal equations for this problem, as well as the 
formulation corresponding to the previous question. 


QUESTION 3.5. (Medium; Z. Bai) Let A € R”*” be positive definite. Two 
vectors uy and uz are called A-orthogonal if uf Aug = 0. If U e R”*” and 
UT AU = I, then the columns of U are said to be A-orthonormal. Show that 
every subspace has an A-orthonormal basis. 


QUESTION 3.6. (Easy; Z. Bai) Let A have the form 


where R is n-by-n and upper triangular, and S is m-by-n and dense. Describe 
an algorithm using Householder transformations for reducing A to upper trian- 
gular form. Your algorithm should not “fill in” the zeros in R and thus require 
fewer operations than would Algorithm 3.2 applied to A. 


QUESTION 3.7. (Medium; Z. Bai) If A= R+ uv’, where R is an upper trian- 
gular matrix, and u and v are column vectors, describe an efficient algorithm 
to compute the QR decomposition of A. Hint: Using Givens rotations, your 
algorithm should take O(n”) operations. In contrast, Algorithm 3.2 would take 
O(n?) operations. 


QUESTION 3.8. (Medium; Z. Bai) Let x € R” and let P be a Householder 
matrix such that Px = ||z||2e1. Let Gi2,...,Gn—i,n be Givens rotations, 
and let Q = Gi2---Gn—in. Suppose Qr = 4||z|/2e1. Must P equal Q? (You 
need to give a proof or a counterexample.) 


QUESTION 3.9. (Easy; Z. Bai) Let A be m-by-n, with SVD A = UXVT. 
Compute the SVDs of the following matrices in terms of U, X, and V: 


1. (A A 
2 AAT A)THAT, 
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3. A(AT A)-}, 
4. A(ATA)-1AT, 


QUESTION 3.10. (Medium; R. Schreiber) Let Ay, be a best rank-k approxima- 
tion of the matrix A, as defined in Part 9 of Theorem 3.3. Let g; be the ith 
singular value of A. Show that A, is unique if ok > Ok+1- 


QUESTION 3.11. (Easy; Z. Bai) Let A be m-by-n. Show that X = At (the 
Moore-Penrose pseudoinverse) minimizes || AX — |p over all n-by-m matrices 
X. What is the value of this minimum? 


QUESTION 3.12. (Medium; Z. Bai) Let A, B, and C be matrices with di- 
mensions such that the product ATC BT is well defined. Let ¥ be the set of 
matrices X minimizing ||AX B—C||p, and let Xo be the unique member of X 
minimizing ||X||7. Show that Xp = A*CBt. Hint: Use the SVDs of A and 
B. 


QUESTION 3.13. (Medium; Z. Bai) Show that the Moore-Penrose pseudoin- 
verse of A satisfies the following identities: 


AAT A A, 

AtAAt = AF, 
AtA = (ATA), 
AAT (AAt) 


QUESTION 3.14. (Medium) Prove part 4 of Theorem 3.3: Let 


T 
H = [$ ^ |, where A is square and A = USVT is its SVD. Let D = 


diag(o1,...,0n), U = [u1,..., Un], and V = [v,...,u,]. Prove that the 2n 


a . . . . Ui 
eigenvalues of H are -+o;, with corresponding unit eigenvectors Zl “i ]. Ex- 
Lui 


tend to the case of rectangular A. 


QUESTION 3.15. (Medium) Let A be m-by-n, m < n, and of full rank. Then 
min || Az — b||2 is called an underdetermined least squares problem. Show that 
the solution is an (n — m)-dimensional set. Show how to compute the unique 
minimum norm solution using appropriately modified normal equations, QR 
decomposition, and SVD. 


QUESTION 3.16. (Medium) Prove Lemma 3.1. 
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QUESTION 3.17. (Hard) In section 2.6.3, we showed how to reorganize Gaus- 
sian elimination to perform Level 2 BLAS and Level 3 BLAS as each step in 
order to exploit the higher speed of these operations. In this problem, we will 
show how to apply a sequence of Householder transformations using Level 2 
and Level 3 BLAS. 


1. Let ui,..., up be a sequence of vectors of dimension n, where ||u;||2 = 1 
and the first 7 — 1 components of u; are zero. Let P = P,- Py_i--- Pr, 
where P; = I — 2ujuz is a Householder transformation. Show that there 
is a b-by-b lower triangular matrix T such that P = I — UTUT, where 
U = [u,..., up]. In particular, provide an algorithm for computing the 
entries of T. This identity shows that we can replace multiplication by b 
Householder transformations Pı through P, by three matrix multiplica- 
tions by U, T, and UT (plus the cost of computing T). 


2. Let House(x) be a function of the vector x which returns a unit vector u 
such that (I — 2uu!)zx = ||x||2e1; we showed how to implement House(z) 
in section 3.4. Then Algorithm 3.2 for computing the QR decomposition 
of the m-by-n matrix A may be written as 


fort? =1:m 
ui = House( A (i : m, i)) 


P,=I1I- Quint 
A(i: m,i:n) = P,A(i:m,i:n) 
endfor 


Show how to implement this in terms of the Level 2 BLAS in an efficient 
way (in particular, matrix-vector multiplications and rank-1 updates). 
What is the floating point operation count? (Just the high-order terms 
in n and m are enough.) It is sufficient to write a short program in the 
same notation as above (although trying it in Matlab and comparing 
with Matlab’s own QR factorization are a good way to make sure that 
you are right!). 


3. Using the results of step (1), show how to implement QR decomposition 
in terms of Level 3 BLAS. What is the operation count? This technique is 
used to accelerate the QR decomposition, just as we accelerated Gaussian 
elimination in section 2.6. It is used in the LAPACK routine sgeqrf. 


QUESTION 3.18. (Medium) It is often of interest to solve constrained least 
squares problems, where the solution x must satisfy a linear or nonlinear con- 
straint in addition to minimizing ||Ax — b||2. We consider one such problem 
here. Suppose that we want to choose x to minimize ||Ax — b||2 subject to 
the linear constraint Cx = d. Suppose also that A is m-by-n, C is p-by-n, 
and C has full rank. We also assume that p < n (so Cx = d is guaranteed to 
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be consistent) and n < m+ p (so the system is not underdetermined). Show 
that there is a unique solution under the assumption that | ; ] has full column 


rank. Show how to compute x using two QR decompositions and some matrix- 
vector multiplications and solving some triangular systems of equations. Hint: 
Look at LAPACK routine sgglse and its description in the LAPACK manual 
[10] (NETLIB /lapack/lug/lapack_lug.html). 


QUESTION 3.19. (Hard; Programming) Write a program (in Matlab or any 
other language) to update a geodetic database using least squares, as described 
in Example 3.3. Take as input a set of “landmarks,” their approximate coordi- 
nates (z;, Yi), and a set of new angle measurements 6; and distance measure- 
ments L;;. The output should be corrections (62;, dy;) for each landmark, an 
error bound for the corrections, and a picture (triangulation) of the old and 
new landmarks. 


QUESTION 3.20. (Hard) Prove Theorem 3.4. 


QUESTION 3.21. (Medium) Redo Example 3.1, using a rank-deficient least 
squares technique from section 3.5.1. Does this improve the accuracy of the 
high-degree approximating polynomials? 


A 


Nonsymmetric Eigenvalue Problems 


4.1. Introduction 


We discuss canonical forms (in section 4.2), perturbation theory (in section 4.3), 
and algorithms for the eigenvalue problem for a single nonsymmetric matrix 
A (in section 4.4). Chapter 5 is devoted to the special case of real symmet- 
ric matrices A = AT (and the SVD). Section 4.5 discusses generalizations 
to eigenvalue problems involving more than one matrix, including motivating 
applications from the analysis of vibrating systems, the solution of linear differ- 
ential equations, and computational geometry. Finally, section 4.6 summarizes 
all the canonical forms, algorithms, costs, applications, and available software 
in a list. 

One can roughly divide the algorithms for the eigenproblem into two groups: 
direct methods and iterative methods. This chapter considers only direct meth- 
ods, which are intended to compute all of the eigenvalues, and (optionally) 
eigenvectors. Direct methods are typically used on dense matrices and cost 
O(n?) operations to compute all eigenvalues and eigenvectors; this cost is rel- 
atively insensitive to the actual matrix entries. 

The main direct method used in practice is QR iteration with implicit shifts 
(see section 4.4.8). It is interesting that after more than 30 years of depend- 
able service, convergence failures of this algorithm have quite recently been 
observed, analyzed, and patched [25, 64]. But there is still no global conver- 
gence proof, even though the current algorithm is considered quite reliable. So 
the problem of devising an algorithm that is numerically stable and globally 
(and quickly!) convergent remains open. 

Iterative methods, which are discussed in Chapter 7, are usually applied 
to sparse matrices or matrices for which matrix-vector multiplication is the 
only convenient operation to perform. Iterative methods typically provide 
approximations only to a subset of the eigenvalues and eigenvectors and are 
usually run only long enough to get a few adequately accurate eigenvalues 
rather than a large number. Their convergence properties depend strongly on 
the matrix entries. 
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4.2. Canonical Forms 


DEFINITION 4.1. The polynomial p(X) = det(A — AI) is called the character- 
istic polynomial of A. The roots of p(A) = 0 are the eigenvalues of A. 


Since the degree of the characteristic polynomial p(X) equals n, the dimen- 
sion of A, it has n roots, so A has n eigenvalues. 


DEFINITION 4.2. A nonzero vector x satisfying Ax = Ax is a (right) eigen- 
vector for the eigenvalue A. A nonzero vector y such that y* A = Ay* is a left 


eigenvector. (Recall that y* = (J)! is the conjugate transpose of y.) 


Most of our algorithms will involve transforming the matrix A into sim- 
pler, or canonical forms, from which it is easy to compute its eigenvalues and 
eigenvectors. These transformations are called similarity transformations (see 
below). The two most common canonical forms are called the Jordan form 
and Schur form. The Jordan form is useful theoretically but is very hard to 
compute in a numerically stable fashion, which is why our algorithms will aim 
to compute the Schur form instead. 

To motivate Jordan and Schur forms, let us ask which matrices have the 
property that their eigenvalues are easy to compute. The easiest case would be 
a diagonal matrix, whose eigenvalues are simply its diagonal entries. Equally 
easy would be a triangular matrix, whose eigenvalues are also its diagonal 
entries. Below we will see that a matrix in Jordan or Schur form is triangular. 
But recall that a real matrix can have complex eigenvalues, since the roots 
of its characteristic polynomial may be real or complex. Therefore, there is 
not always a real triangular matrix with the same eigenvalues as a real general 
matrix, since a real triangular matrix can only have real eigenvalues. Therefore, 
we must either use complex numbers or look beyond real triangular matrices 
for our canonical forms for real matrices. It will turn out to be sufficient to 
consider block triangular matrices, i.e. matrices of the form 


Aı Aig = At 
Ae satin A 
A 22 2 (4.1) 
App 


where each A;; is square and all entries below the A;; blocks are zero. One can 
easily show that the characteristic polynomial det(A — AJ) of A is the product 
Ts det(A;; — AZ) of the characteristic polynomials of the A;; and therefore 
that the set \(A) of eigenvalues of A is the union U?_, (Ai) of the sets of 
eigenvalues of the diagonal blocks Aj; (see Question 4.1). The canonical forms 
that we compute will be block triangular and will proceed computationally by 
breaking up large diagonal blocks into smaller ones. If we start with a complex 
matrix A, the final diagonal blocks will be 1-by-1, so the ultimate canonical 
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form will be triangular. If we start with a real matrix A, the ultimate canonical 
form will have 1-by-1 diagonal blocks (corresponding to real eigenvalues) and 2- 
by-2 diagonal blocks (corresponding to complex conjugate pairs of eigenvalues); 
such a block triangular matrix is called quasi-triangular. 

It is also easy to find the eigenvectors of a (block) triangular matrix; see 
section 4.2.1. 


DEFINITION 4.3. Let S be any nonsingular matrix. Then A and B = S~'AS 
are called similar matrices, and S is a similarity transformation. 


PROPOSITION 4.1. Let B= S~!AS, so A and B are similar. Then A and B 
have the same eigenvalues, and x (or y) is a right (or left) eigenvector of A if 
and only if S~‘x (or S*y) is a right (or left) eigenvector of B. 


Proof. Using the fact that det(X-Y) = det(X)-det(Y) for any square matrices 
X and Y, we can write det(A — AI) = det(S~!(A — AI) S) = det(B — AT), so 
A and B have the same characteristic polynomials. Ax = Ax holds if and only 
if S-'ASS—1x2 = \S—!a or B(S~'x) = A(S—!x). Similarly, y*A = Ay* if and 
only if y*SS-!AS = Ay*S or (S*y)*B = X(S*y)*. 


THEOREM 4.1. Jordan canonical form. Given A, there exists a nonsingular S 
such that S~!AS = J, where J is in Jordan canonical form. This means that 
J is block diagonal, with J = diag(Jn,(A1), Jna (à2), -< -, Jn (Ak )) and 


ORS oS 


a S A 


J is unique, up to permutations of its diagonal blocks. 


For a proof of this theorem, see a book on linear algebra such as [108] or 
[137]. 

Each Jm(A) is called a Jordan block with eigenvalue A of algebraic multi- 
plicity m. If some n; = 1, and A; is an eigenvalue of only that one Jordan 
block, then A; is called a simple eigenvalue. If all n; = 1, so that J is diagonal, 
A is called diagonalizable; otherwise it is called defective. An n-by-n defective 
matrix does not have n eigenvectors, as described in more detail in the next 
proposition. Although defective matrices are “rare” in a certain well-defined 
sense, the fact that some matrices do not have n eigenvectors is a fundamen- 
tal fact confronting anyone designing algorithms to compute eigenvectors and 
eigenvalues. In section 4.3, we will see some of the difficulties that such matri- 
ces cause. Symmetric matrices, discussed in Chapter 5, are never defective. 
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ky k2 ki ki+1 kn 
my QQQ eee QQN mj RZ eee -QSS | my 


Xj = position of i-th mass (0 = equilibrium) 
mj; = mass of i-th mass 

kj = spring constant of i-th spring 

bj = damping constant of i-th damper 


Fig. 4.1. Damped, vibrating mass-spring system. 


PROPOSITION 4.2. A Jordan block has one right eigenvector, e1 = [1,0,...,0]", 
and one left eigenvector, en = [0,...,0,1]’. Therefore, a matrix has n eigen- 
vectors matching its n eigenvalues if and only if it is diagonalizable. In this 
case, STAS = diag(\;). This is equivalent to AS = S diag();), so the ith 
column of S is a right eigenvector for ;. It is also equivalent to STIA = 
diag(\;)S~!, so the conjugate transpose of the ith row of ST} is a left eigen- 
vector for Ai. If all n eigenvalues of a matrix A are distinct, then A is diago- 
nalizable. 


Proof. Let J = Jm(A) for ease of notation. It is easy to see Je; = Ae, and 
eT J = Xe! , so e; and en are right and left eigenvectors of J, respectively. To 
see that J has only one right eigenvector (up to scalar multiples), note that 
any eigenvector x must satisfy (J — AI)x = 0 so zv is in the null space of 


| 0 | 
But the null space of J — AJ is clearly span(e,), so there is just one eigenvector. 
If all eigenvalues of A are distinct, then all its Jordan blocks must be 1-by-1, 


so J = diag(\j,..., An) is diagonal. 


EXAMPLE 4.1. We illustrate the concepts of eigenvalue and eigenvector with a 
problem of mechanical vibrations. We will see a defective matrix arise in a nat- 
ural physical context. Consider the damped mass spring system in Figure 4.1, 
which we will use to illustrate a variety of eigenvalue problems. 

Newton’s law F = ma applied to this system yields 


milt) = ki(xi—1(t) — xilt)) 
force on mass 7 from spring i 


Hki (titi (t) — 24(t)) (4.2) 
force on mass 7 from spring i + 1 
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—b;x;(t) 


force on mass 7 from damper i 


or 


Mi(t) = —Ba(t) — Kzr(t), (4.3) 
where M = diag(mı,..., Mn), B = diag(bi,...,b,), and 


| ky + ko —ko ] 
—kg kə+k3 —kg 
| —kn-1 kn-1+kn —kn | 
—ky kn 


We assume that all the masses m; are positive. M is called the mass matriz, 
B is the damping matriz, and K is the stiffness matriz. 

Electrical engineers analyzing linear circuits arrive at an analogous equation 
by applying Kirchoff’s and related laws instead of Newton’s law. In this case x 
represents branch currents, M represent inductances, B represents resistances, 
and K represents admittances (reciprocal capacitances). 

We will use a standard trick to change this second-order differential equa- 
tion to a first-order differential equation, changing variables to 


This yields 


wo = [g] [proge] 
_ [mea [ae 
= [7 OM, | a0 = Ae. (4.4) 


To solve y(t) = Ay(t), we assume that y(0) is given (i.e., the initial positions 
x(0) and velocities (0) are given). 

One way to write down the solution of this differential equation is y(t) = 
e“'y(0), where e^t is the matrix exponential. We will give another more el- 
ementary solution in the special case where A is diagonalizable; this will be 
true for almost all choices of m;, ki, and b;. We will return to consider other 
situations later. (The general problem of computing matrix functions such as 
e^t is discussed further in section 4.5.1 and Question 4.4.) 

When A is diagonalizable, we can write A = SAS~!, where A = diag(\j,...,An)- 
Then ġ(t) = Ay(t) is equivalent to y(t) = SAS~+y(t) or S~ty(t) = AS~ty(t) 
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Positions 


Velocities 
1 1 


0 2 4 6 8 10 12 14 16 18 2 
Time 
Fig. 4.2. Positions and velocities of a mass-spring system with four masses mı = 
ma = 2 and mz = m3 = 1. The spring constants are all ki = 1. The damping 
constants are all b; = .4. The initial displacements are xı (0) = —.25, x2(0) = x3(0) = 
0, and x4(0) = .25. The initial velocities are vı(0) = —1, v2(0) = v3(0) = 0, and 
v4(0) = 1. The equilibrium positions are 1, 2, 3, and 4. The software for solving and 
plotting an arbitrary mass-spring system is HOMEPAGE/Matlab/massspring.m. 


or 2(t) = Az(t), where z(t) = S~ty(t). This diagonal system of differ- 
ential equations 2;(t) = ;2;(t) has solutions z;(t) = e*'z;(0), so y(t) = 
Sdiag(e*",...,e4»")S—!y(0) = Se“*S—1y(0). A sample numerical solution for 
four masses and springs is shown in Figure 4.2. 

To see the physical significance of the nondiagonalizability of A for a 
mass-spring system, consider the case of a single mass, spring, and damper, 


whose differential equation we can simplify to mz(t) = —ba(t) — ka(t), and so 

A=| ct i ]. The two eigenvalues of A are Ay = x4 (—14(1-44)'/”). 

When tem < 1, the system is overdamped, and there are two negative real 

eigenvalues, whose mean value is ash. In this case the solution eventually 
4km 


decays monotonically to zero. When =" > 1, the system is underdamped, 
and there are two complex conjugate eigenvalues with real part -x.. In this 
case the solution oscillates while decaying to zero. In both cases the system is 
diagonalizable since the eigenvalues are distinct. When dete = 1, the system 
is critically damped, there are two real eigenvalues equal to —5e, and A has 
a single 2-by-2 Jordan block with this eigenvalue. In other words, the non- 
diagonalizable matrices form the “boundary” between two physical behaviors: 


oscillation and monotonic decay. 


When A is diagonalizable but S is an ill-conditioned matrix, so that S71 is 


Nonsymmetric Eigenvalue Problems 145 


difficult to evaluate accurately, the explicit solution y(t) = Se4*S~!y(0) will be 
quite inaccurate and useless numerically. We will use this mechanical system 
as a running example because it illustrates so many eigenproblems. © 


To continue our discussion of canonical forms, it is convenient to define the 
following generalization of an eigenvector. 


DEFINITION 4.4. An invariant subspace of A is a subspace X of R”, with the 
property that x € X implies that Ax € X. We also write this as AX C X. 


The simplest, one-dimensional invariant subspace is the set span(z) of all 
scalar multiples of an eigenvector x. Here is the analogous way to build an 
invariant subspace of larger dimension. Let X = [21,...,@m], where 71,...,2%m 
are any set of independent eigenvectors with eigenvalues A1,..., Am. Then 
X = span(X) is an invariant subspace since x € X implies x = S71", aja; for 
some scalars a;, so Ar = So", ajAx; = Soi, aidiz; E€ X. AX will equal X 
unless some eigenvalue A; equals zero. The next proposition generalizes this. 


PROPOSITION 4.3. Let A be n-by-n, let X = [x1,...,L2m] be any n-by-m matrix 
with independent columns, and let X = span(X), the m-dimensional space 
spanned by the columns of X. Then X is an invariant subspace if and only 
if there is an m by m matrix B such that AX = XB. In this case the m 
eigenvalues of B are also eigenvalues of A. (When m = 1, X = [zı] is an 
eigenvector and B is an eigenvalue.) 


Proof. Assume first that X is invariant. Then each Ax; is also in X, so each 
Az; must be a linear combination of a basis of X, say, Ax; = 7%", jbji. This 
last equation is equivalent to AX = XB. Conversely, AX = XB means that 
each Ax; is a linear combination of columns of X, so X is invariant. 

Now assume AX = BX. Choose any n-by-(n — m) matrix X such that 
X = [X,X] is nonsingular. Then A and X~!AX are similar and so have 


the same eigenvalues. Write X~! = | pane ], so X-1X = I implies 
= Sy Clay IY = YAX AR a. 
YX =I ma ¥X =0. Then k Ak kA Aala 


e e eG 
YXB YAX 0 YAX 
the union of the eigenvalues of B and the eigenvalues of Y AX. 


. Thus by Question 4.1 the eigenvalues of A are 


For example, write the Jordan canonical form STIAS = J = 
diag(Jn,(Ai)) as AS = SJ, where S = [S1, S2,..., Sp] and S; has n; columns 
(the same as Jn;(à:)). Then AS = SJ implies AS; = SiJn; (Ai), i.e., span(S;) 
is an invariant subspace. 

The Jordan form tells everything that we might want to know about a 
matrix and its eigenvalues, eigenvectors, and invariant subspaces. There are 
also explicit formulas based on the Jordan form to compute e4 or any other 


146 Applied Numerical Linear Algebra 


function of a matrix (see section 4.5.1). But it is bad to compute the Jordan 
form for two numerical reasons: 

First Reason: It is a discontinuous function of A, so any rounding error 
can change it completely. 


EXAMPLE 4.2. Let 


which is in Jordan form. For arbitrarily small <€, adding i- € to the (7,7) entry 
changes the eigenvalues to the n distinct values i- «€, and so the Jordan form 
changes from J,,(0) to diag(e,2e,...,ne). © 


Second Reason: It cannot be computed stably in general. In other words, 
when we have finished computing S and J, we cannot guarantee that S~'(A+ 
6A)S = J for some small ôA. 


EXAMPLE 4.3. Suppose STAS = J exactly, where S is very ill-conditioned. 
(K(S) = ||.S|j - || $~1]| is very large.) Suppose that we are extremely lucky and 
manage to compute S exactly and J with just a tiny error ôJ with ||6J|| = 
O(e)|| A||. How big is the backward error? In other words, how big must ôA 
be so that STH(A + 6A)S = J+6J? We get 5A = SJS}, and all that we 
can conclude is that ||6Al] < IISI] ||67|| ISTH] = O(e)K(S)|| Al]. Thus ||6.A]| 
may be much larger than ¢||A||, which prevents backward stability. © 


So instead of computing STAS = J, where S can be an arbitrarily ill- 
conditioned matrix, we will restrict S to be orthogonal (so K2(S) = 1) to 
guarantee stability. We cannot get a canonical form as simple as the Jordan 
form any more, but we do get something almost as good. 


THEOREM 4.2. Schur canonical form. Given A, there exists a unitary matrix 
Q and an upper triangular matrix T such that Q* AQ =T. The eigenvalues of 
A are the diagonal entries of T. 


Proof. We use induction on n. It is obviously true if A is 1 by 1. Now let A 
be any eigenvalue and u a corresponding eigenvector normalized so ||u||2 = 1. 
Choose U so U = |u,U] is a square unitary matrix. (Note À and u may be 
complex even if A is real.) Then 
u* ~ u*Au u*AU 
U*-A-U=] = -A-[u,UJ] =] ~ Bete ills 
| u* | [aU] | U*Au U*AU | 
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Now as in the proof of Proposition 4.3, we can write u* Au = Au*u = A, 


and U*Au = AŬ*u = 0 so U* AU = | ; an ]. By induction, there is a unitary 


P, so P* AP = T is upper triangular. Then 
* an AÀ a12 = 1 0 AÀ ai2 P 1 0 
wav=| à pe al AE T 0o P|’ 


G eraf sjh F] 


SO 


0 P* P 0 T 


is upper triangular and Q = U[ ; - ] is unitary as desired. 

Notice that the Schur form is not unique, because the eigenvalues may 
appear on the diagonal of T in any order. 

This introduces complex numbers even when A is real. When A is real, we 
prefer a canonical form that uses only real numbers, because it will be cheaper 
to compute. As mentioned at the beginning of this section, this means that we 
will have to sacrifice a triangular canonical form and settle for a block-triangular 
canonical form. 


THEOREM 4.3. Real Schur canonical form. If A is real, there exists a real 
orthogonal matriz V such that VTAV = T is quasi-upper triangular. This 
means that T is block upper triangular with 1-by-1 and 2-by-2 blocks on the 
diagonal. Its eigenvalues are the eigenvalues of its diagonal blocks. The 1- 
by-1 blocks correspond to real eigenvalues, and the 2-by-2 blocks to complex 
conjugate pairs of eigenvalues. 


Proof. We use induction as before. Let A be an eigenvalue. If A is real, it has 
a real eigenvector u and we proceed as in the last theorem. If A is complex, let 
u be a (necessarily) complex eigenvector, so Tu = Au. Since Tu = Tu = Au, À 
and @ are also an eigenvalue/eigenvector pair. Let up = žu+4u be the real part 
of u and uy = 4u— +T be the imaginary part. Then span{ur, ur} = span{u, T} 
is a two-dimensional invariant subspace. Let U = [up, ur] and U = QR be its 
QRD.Thus span{Q} = span{up, uy} is invariant. Choose Q so that U = [Q, Q] 


is real and orthogonal, and compute 


QTAQ QTAQ 
QTAQ QTAQ | 


UTA U= | 3, ]-4-(0,0)=| 


Since Q spans an invariant subspace, there is a 2-by-2 matrix B such that 
AQ = QB. Now as in the proof of Proposition 4.3, we can write QT AQ = 


~ ~ T 4 
QTQB = B and QTAQ = OTQB =0, so UTAU =| 7 2,49 


0 OTAO ]. Now apply 
induction to QT AQ. 
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4.2.1. Computing Eigenvectors from the Schur Form 


Let Q* AQ = T be the Schur form. Then if Tz = Az, we have AQz = QT x = 
AQx, so Qz is an eigenvector of A. So to find eigenvectors of A, it suffices to 
find eigenvectors of T. 

Suppose that A = t;; has multiplicity 1 (i.e., it is simple). Write (T—AI)x = 
0 as 


Ti — XI T\2 T\3 Tı 
0} = 0 0 To3 x2 
0 0 T33 — AL T3 
(Ti — AL)ay + Ti2£2 + T1323 
= 1)3%3 
(T33 or AT) x3 


where Ti; is (i — 1)-by-(i — 1), Ta2 = A is L-by-1, Tsg is (n — i)-by-(n — i), and 
x is partitioned conformably. Since A is simple, both Tj; — AI and 733 — AI 
are nonsingular, so (T33 — AI )z3 = 0 implies x3 = 0. Therefore (Tii — AJ)a1 = 


—T\2%2. Choosing (arbitrarily) z2 = 1 means zı = —(Ti1 — AI)~!Ti2, so 
(AL = Tu) tT 
L= 1 
0 


In other words, we just need to solve a triangular system for xı. To find a 
real eigenvector from real Schur form, we get a quasi-triangular system to solve. 
Computing complex eigenvectors from real Schur form using only real arith- 
metic also just involves equation solving but is a little trickier. See subroutine 
strevc in LAPACK for details. 


4.3. Perturbation Theory 


In this section we will concentrate on understanding when eigenvalues are ill- 
conditioned and thus hard to compute accurately. In addition to providing 
error bounds for computed eigenvalues, we will also relate eigenvalue condition 
numbers to related quantities, including the distance to the nearest matrix 
with an infinitely ill-conditioned eigenvalue, and the condition number of the 
matrix of eigenvectors. 

We begin our study by asking when eigenvalues have infinite condition 
numbers. This is the case for multiple eigenvalues, as the following example 
illustrates. 


EXAMPLE 4.4. Let 


D 
I 
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be an n-by-n matrix. Then A has characteristic polynomial A” — (—1)"e = 0 
so A = W+e (n possible values). The nth root of € grows much faster than 
any multiple of e for small e. For example, take n = 16 and e = 10716. Then 


for each eigenvalue |A| = .1. More formally, the condition number is infinite 
tii 
because La “—=wate=Oforn>2. o 


So we expect a large condition number if an eigenvalue is “close to multi- 
ple”, i.e., there is a small 6A such that A+ôA has exactly a multiple eigenvalue. 
Having an infinite condition number does not mean that they cannot be com- 
puted with any correct digits, however. 


PROPOSITION 4.4. Eigenvalues of A are continuous functions of A, even if 
they are not differentiable. 


Proof. It suffices to prove the continuity of roots of polynomials, since the 
coefficients of the characteristic polynomial are continuous (in fact polynomial) 
functions of the matrix entries. We use the argument principle from complex 


analysis [2]: the number of roots of a polynomial p inside a simple closed curve 

> 1 g pl) p' (z) 

YS Bai Fy plz) Bl) 
p' (2) 


so sk n o dz is changed just a little. But since it is an integer, it must be 
constant, so the number of roots inside the curve y is constant. This means 
that the roots cannot pass outside the curve y (no matter how small y is, 
provided that we perturb p by little enough), so the roots must be continuous. 


dz. If p is changed just a little, is changed just a little, 


In what follows, we will concentrate on computing the condition number 
of a simple eigenvalue. If A is a simple eigenvalue of A and 6A is small, then 
we can identify an eigenvalue A + 6A of A + ôA “corresponding to” A: it is 
the closest one to A. We can easily compute the condition number of a simple 
eigenvalue. 


THEOREM 4.4. Let be a simple eigenvalue of A with right eigenvector x and 
left eigenvector y, normalized so that ||x||2 = |lyllg = 1. Let A+ 6A be the 
corresponding eigenvalue of A+ ôA. Then 


5A 
[5A| 


sAr + Q(||5All?) or 


Ki 
< ball + O(\|sA|?) = sec O(y, 2)||6All + O(A), 


A 


where O(y, x) is the acute angle between y and x. In other words, sec O(y, x) = 
1/|y*x| is the condition number of the eigenvalue A. 


Proof. Subtract Ax = Ax from (A + 6A)(x + 6x) = (A+ 6A)(x + ôx) to get 
Adz + Ax + Axr = Ada + JAK + AST. 


Ignore the second-order terms (those with two “ô terms” as factors: Ada and 
dAdx) and multiply by y* to get y* Ada + y*d Ax = y* Ada + y* OAR. 
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Now y* Ada cancels y*\dx, so we can solve for dA = (y*dAx)/(y*x) as 
desired. 


Note that a Jordan block has right and left eigenvectors e; and en, respec- 
tively, so the condition number of its eigenvalue is 1/|e*e1| = 1/0 = oo, which 
agrees with our earlier analysis. 

At the other extreme, in the important special case of symmetric matrices, 
the condition number is 1, so the eigenvalues are always accurately determined 
by the data. 


COROLLARY 4.1. Let A be symmetric (or normal: AA* = A*A). Then |A| < 
||5Al] + O(|5Al|*). 


Proof. If A is symmetric or normal, then its eigenvectors are all orthogonal, 
i.e., Q* AQ = A with QQ* = I. So the right eigenvectors x (columns of Q) and 
left eigenvectors y (conjugate transposes of the rows of Q*) are identical, and 
1/ly*a| = 1. 

To see a variety of numerical examples, run the Matlab code referred to in 
Question 4.14. 

Later, in Theorem 5.1, we will prove that in fact |6\| < ||6All2 if 6A = 547, 
no matter how large || All is. 


Theorem 4.4 is only useful for sufficiently small ||d A]. We can remove the 
O(||5A||?) term and so get a simple theorem true for any size perturbation 
|S A||, at the cost of increasing the condition number by a factor of n. 


THEOREM 4.5. Bauer—Fike. Let A have all simple eigenvalues (i.e., be diago- 
nalizable). Call them Ai, with right and left eigenvectors x; and yi, normalized 
so ||x:|l2 = ||ysl]2 = 1. Then the eigenvalues of A+ 6A lie in disks Bi, where 
B; has center A; and radius n feel 

Our proof will use Gershgorin’s theorem (Theorem 2.9), which we repeat 
here. 


GERSHGORIN’S THEOREM. Let B be an arbitrary matriz. Then the eigen- 
values À of B are located in the union of the n disks defined by |X — by| < 
are |bi;| fori =1 ton. 


We will also need two simple lemmas. 


LEMMA 4.1. Let S = |z1,..., £n], the matrix of right eigenvectors. Then 
yt /Yirr 
eae Y5/YoX2 ; 


| te | 
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Proof of Lemma. We know that AS = SA, where A = diag(A1,..., An), since 
the columns z; of S are eigenvectors. This is equivalent to S~'A = AS™!, so 
the rows of S~! are conjugate transposes of the left eigenvectors y;. So 


Yi C1 
Se l 
Un ‘Cn 
for some constants c;. But I = S~!S, so 1 = ($718) = y* £i: ci, and ¢ = = 


Yi Ti 


as desired. 


LEMMA 4.2. If each column of S has two-norm equal to 1, ||S]|2 < vyn. Simi- 
larly, if each row of a matrix has two-norm equal to 1, its two-norm is at most 


Jn. 


Proof of Lemma. ||Sll2 = ||S*||2 = maxy,),=1 ||S72|l2. Each component 
of STz is bounded by 1 by the Cauchy-Schwartz inequality, so ||$7a|lz < 


[eee F] = vn. 


Proof of the Bauer-Fike theorem. We will apply Gershgorin’s theorem to 
STI(A + 6A)S = A + F, where A = STIAS = diag(à1,..., àn) and F = 
S-!5AS. The idea is to show that the eigenvalues of A + 6A lie in balls cen- 
tered at the A; with the given radii. To do this, we take the disks containing 
the eigenvalues of A + F that are defined by Gershgorin’s theorem 


Pattaya, 


j=i 
and enlarge them slightly to get the disks 
|A—Ail < Ss" [Fizl 
j 
1/2 
< nt. `> lfa) by Cauchy — —Schwartz 
j 
= n". ||F(,:)ll2 (4.5) 


Now we need to bound the two-norm of the ith row F(i,:) of F = S~15AS: 


IEG, Jlle = I(S718AS), :)ll2 
< ISTHE Dle: Alle; [S]l2_ by Lemma 1.7 
aire 
< wa |ôA]|l2 by Lemmas 4.1 and 4.2. 


Combined with equation (4.5), this proves the theorem. 
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We do not want to leave the impression that multiple eigenvalues cannot 
be computed with any accuracy at all just because they have infinite condition 
numbers. Indeed, we expect to get a fraction of the digits correct rather than 
lose a fixed number of digits. To illustrate, consider the 2-by-2 matrix with a 


; ; ]. If we perturb the (2,1) entry (the most 


sensitive) from 0 to machine epsilon £, the eigenvalues change from 1 to 1+,/e. 
In other words the computed eigenvalues agree with the true eigenvalue to half 
precision. More generally, with a triple root, we expect to get about one third 
of the digits correct, and so on for higher multiplicities. See also Question 1.20. 

We now turn to a geometric property of the condition number shared by 
other problems. Recall the property of the condition number || A|| - ||A7+|| 
for matrix inversion: its reciprocal measured the distance to nearest singular 
matrix, i.e., matrix with an infinite condition number (see Theorem 2.1). An 
analogous fact is true about eigenvalues. Since multiple eigenvalues have infi- 
nite condition numbers, the set of matrices with multiple eigenvalues plays the 
same role for computing eigenvalues as the singular matrices did for matrix 
inversion, where being “close to singular” implied ill-conditioning. 


double eigenvalue at 1: A = [ 


THEOREM 4.6. Let À be a simple eigenvalue of A, with unit right and left 
eigenvectors x and y and condition number c = 1/|y*a|. Then there is a 6A 
such that A+ ôA has a multiple eigenvalue at A, and 

loo 1 

[Alle ~ Ve? =1 
When c > 1, i.e., the eigenvalue is ill-conditioned, then the upper bound on 
the distance is 1/Vc? — 1 ~ 1/c, the reciprocal of the condition number. 


Proof. First we show that we can assume without loss of generality that A is 
upper triangular (in Schur form), with aj; = à. This is because putting A in 
Schur form is equivalent to replacing A by T = Q* AQ, where Q is unitary. If 
x and y are eigenvectors of A, then Q*x and Q*y are eigenvectors of T. Since 
(Q*y)*(Q*a) = y*QQ*x = y*x, changing to Schur form does not change the 
condition number of A. (Another way to say this is that the condition number 
is the secant of the angle O(x,y) between x and y, and changing x to Q*x 
and y to Q*y just rotates x and y the same way without changing the angle 


between them.) 
à A 
0 Az | 


x =e, and y is parallel to § = [1, Ai2(AI — A22)7}]*, or y = g/|lğll2. Thus 


So without loss of generality we can assume that A = [| Then 


1 ME - z 
1 = Wile = lgh = + AnA- A2) 
ly*a| [gra 
or 
=l || Ai2(AZ — A22)" II2 < ||Ai2ll2 < ||(AZ — Aza)" Io 
|| Allo 


Omin(AI — A22) 
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By definition of the smallest singular value, there is a 6A22 where ||6 A22||/2 = 
Omin(AI — Ag2) such that A22 + dAg2 — AI is singular; i.e., A is an eigenvalue 


of Ago + 6A22. Thus | i Phe: as ] has a double eigenvalue at A, where 
A 
||6Az2||2 < Omin(AL — A22) < peen 


as desired. 

Finally, we relate the condition numbers of the eigenvalues to the smallest 
possible condition number ||S||-||S7}|| of any similarity S such that diagonalizes 
A: S-'AS = A = diag(A1,...,An). The theorem says that if any eigenvalue 
has a large condition number, then S has to have an approximately equally 
large condition number. In other words, the condition numbers for finding the 
(worst) eigenvalue and for reducing the matrix to diagonal form are nearly the 


same. 


THEOREM 4.7. Let A be diagonalizable with eigenvalues A; and right and left 


eigenvectors x; and yi, respectively, normalized so ||zill2 = ||yill2 = 1. Sup- 
pose that S satisfies STAS = A = diag(\i,...,An). Then ||S\l2 - St > 
max; 1/|y*2x;|. If we choose S = [x1,...,@n], then ||S||2°||S~"|l2 < n:maxı 1/lyž xil; 


i.e., the condition number of S is within a factor of n of its smallest value. 


For a proof, see [68]. 

For an overview of condition numbers for the eigenproblem, including eigen- 
vectors, invariant subspaces, and the eigenvalues corresponding to an invariant 
subspace, see chapter 4 of the LAPACK manual [10], as well as [159, 235]. Al- 
gorithms for computing these condition numbers are available in subroutines 
strsna and strsen of LAPACK or by calling the driver routines sgeevx and 
sgeesx. 


4.4. Algorithms for the Nonsymmetric Eigenproblem 


We will build up to our ultimate algorithm, the shifted Hessenberg QR algo- 
rithm, by starting with simpler ones. For simplicity of exposition, we assume 
A is real. 

Our first and simplest algorithm is the power method (section 4.4.1), which 
can find only the largest eigenvalue of A in absolute value and the correspond- 
ing eigenvector. To find the other eigenvalues and eigenvectors, we apply the 
power method to (A—aJ)~! for some shift o, an algorithm called inverse itera- 
tion (section 4.4.2); note that the largest eigenvalue of (A—aI)~! is 1/(A;—<), 
where A; is the closest eigenvalue to ø, so we can choose which eigenvalues to 
find by choosing o. Our next improvement to the power method lets us com- 
pute an entire invariant subspace at a time rather than just a single eigenvector; 
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we call this orthogonal iteration (section 4.4.3). Finally, we reorganize orthog- 
onal iteration to make it convenient to apply to (A — aI)~! instead of A; this 
is called QR Iteration (section 4.4.4). 

Mathematically speaking, QR iteration (with a shift ø) is our ultimate 
algorithm. But several problems remain to be solved to make it sufficiently 
fast and reliable for practical use (section 4.4.5). Section 4.4.6 discusses the 
first transformation designed to make QR iteration fast: reducing A from dense 
to upper Hessenberg form (nonzero only on and above the first subdiagonal). 
Subsequent sections describe how to implement QR iteration efficiently on 
upper Hessenberg matrices. (Section 4.4.7 shows how upper Hessenberg form 
simplifies in the cases of the symmetric eigenvalue problem and SVD.) 


4.4.1. Power Method 


ALGORITHM 4.1. Power method: Given xo, we iterate 


1=0 

repeat 
Yir1 = Ax; 
Lig = Yiti1/|lyirr\l2 (approximate eigenvector) 
M41 = th Atay (approximate eigenvalue) 
i=1i+1 


until convergence 


Let us first apply this algorithm in the very simple case when A = diag(Aj, 
<An), with JA] > [2| > --- > |An|. In this case the eigenvectors are 
just the columns e; of the identity matrix. Note that x; can also be written 
xi = Atxo/||Axoll2, since the factors 1/|lyi+ıll2 only scale z;+ı to be a unit 
vector and do not change its direction. Then we get 


: 1 
Q1 ar} | j ] 
: A Q2 ayr§ : 
A'zo = A’ i = i alle ay} 


Qn An’, | Qn ($ j | 
ai \ Ai 

where we have assumed a; = 0. Since all the fractions \;/A; are less than 1 
in absolute value, Atzo becomes more and more nearly parallel to e1, so £i = 
A'‘x/||A*xo||2 becomes closer and closer to +e1, the eigenvector corresponding 
to the largest eigenvalue A,. The rate of convergence depends on how much 
smaller than 1 the ratios |A2/Ai| > -++ > |An/A1| are, the smaller the faster. 
Since x; converges to +e}, Ài = xT Ax; converges to 1, the largest eigenvalue. 

In showing that the power method converges, we have made several as- 
sumptions, most notably that A is diagonal. To analyze a more general case, 
we now assume that A = SAST! is diagonalizable, with A = diag(\1,...,An) 
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and the eigenvalues sorted so that |Ai| > |A2| > --- > |An|. Write S = 
[S1,.--,8n], where the columns s; are the corresponding eigenvectors and also 
satisfy ||s;||2 = 1; in the last paragraph we had S = I. This lets us write 
xo = S(S~'2x9) = S([a1,...,Qn]7). Also, since A = SAS, we can write 
AP = (SAST}) --- (SAST}) = SAtS 
KH ma 
i times 


since all the S~!- § pairs cancel. This finally lets us write 


ar} | ` i 
azi l a ila 
| | = ais (3) 


U aka 


As before, the vector in brackets converges to e1, so A’x gets closer and closer 
to a multiple of Se; = sı, the eigenvector corresponding to »1. Therefore, 
ÀX; = xT Ax; converges to sT Ası = st 181 = )j. 

A minor drawback of this method is the assumption that a; = 0; this is true 
with very high probability if zo is chosen at random. A major drawback is that 
it converges to the eigenvalue/eigenvector pair only for the eigenvalue of largest 
absolute magnitude, and its convergence rate depends on |A2/A;|, a quantity 
which may be close to 1 and thus cause very slow convergence. Indeed, if A 
is real and the largest eigenvalue is complex, there are two complex conjugate 
eigenvalues of largest absolute value |\;| = |Az2|, and so the above analysis 
does not work at all. In the extreme case of an orthogonal matrix, all the 
eigenvalues have the same absolute value, namely, 1. 


4.4.2. Inverse Iteration 


We will overcome the drawbacks of the power method just described by ap- 
plying the power method to (A —oJ)~! instead of A, where ø is called a shift. 
This will let us converge to the eigenvalue closest to ø, rather than just Aj. 
This method is called inverse iteration or the inverse power method. 


ALGORITHM 4.2. Inverse iteration: Given xo, we iterate 


1=0 

repeat 
Yin = (A — oI) tz; 
Zi+1 = Yiti/|lyi+r|l2 (approximate eigenvector) 
Nyt = tj AGG (approximate eigenvalue) 
i=i+1 


until convergence 
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To analyze the convergence, note that A = SAST! implies A—oI = S(A— 
oI)S' and so (A—aI)-! = S(A—cI)“1S—!. Thus (A—oI)—! has the same 
eigenvectors s; as A with corresponding eigenvalues ((A—oI)~!);; = (Aj-o) 1. 
The same analysis as before tells us to expect x; to converge to the eigenvector 
corresponding to the largest eigenvalue in absolute value. More specifically, 
assume that |A% — ø| is smaller than all the other |A; — o| so that (A, — o)7! 
is the largest eigenvalue in absolute value. Also, write 79 = S[a1,...,Q@n]/ as 


before, and assume az = 0. Then 


a> ailà — o)’ 
(A-=0oI)*zro = Ghar ss) 1 Ss 
S anlàn — 0) 
a An—o 
beg 
= aklÀk = o)'S 1 | 5 


E 

ap \An-o 

where the 1 is in entry k. Since all the fractions (Ax — 7) /(A; — 0) are less than 
one in absolute value, the vector in brackets approaches ex, so (A — a1)~*ao 
gets closer and closer to a multiple of Se, = sz, the eigenvector corresponding 
to Ax. As before, A; = xT Ax; also converges to Ap. 

The advantage of inverse iteration over the power method is the ability to 
converge to any desired eigenvalue (the one nearest the shift 7). By choosing o 
very close to a desired eigenvalue, we can converge very quickly and not be as 
limited by the proximity of nearby eigenvalues as is the original power method. 
The method is particularly effective when we have a good approximation to 
an eigenvalue and want only its corresponding eigenvector (for example, see 
section 5.3.4). Later we will explain how to choose such a ø without knowing 
the eigenvalues, which is what we are trying to compute in the first place! 


4.4.3. Orthogonal Iteration 


Our next improvement will permit us to converge to a (p > 1)-dimensional 
invariant subspace, rather than one eigenvector at atime. It is called orthogonal 
iteration (and sometimes subspace iteration or Simultaneous Iteration). 


ALGORITHM 4.3. Orthogonal iteration: Let Zo be an n x p orthogonal matriz. 
Then we iterate 

1=0 

repeat 
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Yi+ı = AZ; 
Factor Yi44 = Zi41Ri41 using Algorithm 3.2 (QR decomposition) 
(Zi41 spans an approximate 
invariant subspace) 
i=1i+1 
until convergence 


Here is an informal analysis of this method. Assume |Ap| > |Ap+1|. If p = 1, 
this method and its analysis are identical to the power method. When p > 1, we 
write span{Z;41} = span{Yi41} = span{AZ;}, so span{Z;} = span{ A’ Zo} = 
span{SA'S~!Z}. Note that 


SA'S Zo 


S diag(à},... , X)S t Zo 
| (A1/Ap)! | 


Since |2+| > 1 if i < p, and |%]| < 1 if i > p, we get 
Ap Xp 


(A1/Ap)! 
ST! Zo = 


PXp 

t 
i y(n-P) xp y 
(An/Ap)' i 
where Y; approaches zero like (Ap+1/Àp)f, and X; does not approach zero. 
Indeed, if Xo has full rank (a generalization of the assumption in section 4.4.1 
that a, = 0), then X; will have full rank too. Write the matrix of eigenvectors 
S = [s1,..-,8n] = See GRRR, i.e., Sp = [s1,..., Sp]. Then SAST! Zo = 
ALS] - ] = ?(S,Xi + SpY;). Thus span(Z;) converges to 


$ 


span(Z;) span(SA'S~1Zo) = span(SpX; + Sp¥;) > span(SpX;) 


span(Sp), 


the invariant subspace spanned by the first p eigenvectors, as desired. 

The use of the QR decomposition keeps the vectors spanning 
span{ A’Zo} of full rank despite roundoff. 

Note that if we follow only the first p < p columns of Z; through the it- 
erations of the algorithm, they are identical to the columns that we would 
compute if we had started with only the first p columns of Zp instead of p 
columns. In other words, orthogonal iteration is effectively running the algo- 
rithm for p = 1,2,...,p all at the same time. So if all the eigenvalues have 
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distinct absolute values, the same convergence analysis as before implies that 
the first p < p columns of Z; converge to span{s1,..., 55} for any P < p. 

Thus, we can let p = n and Zp = I in the orthogonal iteration algorithm. 
The next theorem shows that under certain assumption, we can use orthogonal 
iteration to compute the Schur form of A. 


THEOREM 4.8. Consider running orthogonal iteration on matriz A with p =n 
and Zo = I. If all the eigenvalues of A have distinct absolute values and if 
all the principal submatrices S(1 : j,1: j) have full rank, then A; = ZT AZ; 
converges to the Schur form of A, i.e., an upper triangular matrix with the 
eigenvalues on the diagonal. The eigenvalues will appear in decreasing order 
of absolute value. 


Sketch of Proof. The assumption about nonsingularity of S(1 : 7,1: j) for 
all j implies that Xo is nonsingular, as required by the earlier analysis. First 
note that Z; is a square orthogonal matrix, so A and A; = ZT AZ; are similar. 
Write Z; = [Z1;, Zoi], where Z1; has p columns, so 


ZUAZ ZLATA 


— ZTAĄZ;, = 
PE AR ZŁAZu Z4AZæ% |` 


Since span{ Z1; } converges to an invariant subspace of A, span{ AZ1;} converges 
to the same subspace, so ZEAZ\; converges to Zł Zii = 0. Since this is true 
for all p < n, every subdiagonal entry of A; converges to zero, so A; converges 
to upper triangular form, i.e., Schur form. 

In fact, this proof shows that the submatrix ZEAZ,; = A;i(p+ 1: n,1: p) 
should converge to zero like |\p41/Ap|’. Thus, A» should appear as the (p, p) 
entry of A; and converge like max(|Ap+1/Apl|", |[Ap/Ap—1|")- 


EXAMPLE 4.5. The convergence behavior of orthogonal iteration is illustrated 
by the following numerical experiment, where we took A = diag(1, 2, 6, 30) and 
a random S$ (with condition number about 20), formed A = S - A- ST}, and 
ran orthogonal iteration on A with p = 4 for 19 iterations. Figures 4.3 and 
4.4 show the convergence of the algorithm. Figure 4.3 plots the actual errors 
|Ai(p, p)—Ap| in the computed eigenvalues as solid lines and the approximations 
max(|Ap41/Ap|*, |Ap/Ap—1|") as dotted lines. Since the graphs are (essentially) 
straight lines with the same slope on a semilog scale, this means that they are 
both graphs of functions of the form y = c- rê, where c and r are constants 
and r (the slope) is the same for both, as we predicted above. 

Similarly, Figure 4.4 plots the actual values ||A;(p + 1: n,1: p)||2 as solid 
lines and the approximations |\)+1/Ap|’ as dotted lines; they also match well. 
Here are Ag and Ajg for comparison: 


2.3595 24.526 14.596 —5.8157 


3.5488 15.593 8.5775 E 
a ie 27.599 21.483 eel 


1.9227 55.667 39.717 —10.558 
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Convergence of lambda(1)= 30 


Fig. 4.3. Convergence of diagonal entries during orthogonal iteration. 


| 30.000 —32.557 —70.844 14.984 | 
Ge 6.7607 - 1071? 6.0000 1.8143 —.55754 

1.5452 - 10723 1.1086- 10-9 2.0000 —.25894 |` 

pe 3.3769 - 10715 4.9533 -1076 1.0000 | 


EXAMPLE 4.6. To see why the assumption in Theorem 4.8 about nonsingular- 
ity of S(1 : 7,1: j) is necessary, suppose that A is diagonal with the eigenvalues 
not in decreasing order. Then orthogonal iteration yields Z; = diag(+1) (a di- 
agonal matrix with diagonal entries +1) and A; = A for all 7, so the eigenvalues 
do not move into decreasing order. To see why the assumption that the eigen- 
values have distinct absolute values is necessary, suppose that A is orthogonal, 
so all its eigenvalues have absolute value 1. Again, the algorithm leaves A; 
essentially unchanged. (The rows and columns may be multiplied by —1.) 


4.4.4. QR Iteration 


Our next goal is to reorganize orthogonal iteration to incorporate shifting and 
inverting, as in section 4.4.2. This will make it more efficient and eliminate 
the assumption that eigenvalues differ in magnitude, which was needed in 
Theorem 4.8 to prove convergence. 
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5 Convergence of Schur form(2:4 , 1:1) E Convergence of Schur form(3:4 , 1:2) 
10 


Fig. 4.4. Convergence to Schur form during orthogonal iteration. 


ALGORITHM 4.4. QR iteration: Given Ao, we iterate 


1=0 

repeat 
Factor A; = QiR; (the QR decomposition) 
Aii = RQ; 
pst 


until convergence 


Since Ajit = RQ; = QT (QiRi)Q: = QTA:Qi, Ai+1 and A; are orthogo- 
nally similar. 

We claim that the A; computed by QR iteration is identical to the matrix 
ZT AZ; implicitly computed by orthogonal iteration. 


LEMMA 4.3. A; = Zi AZ,, where Z; is the matrix computed from orthogonal 
iteration (Algorithm 4.3). Thus A; converges to Schur form if all the eigenval- 
ues have different absolute values. 


Proof. We use induction. Assume A; = Z/AZ;. From Algorithm 4.3, 
we can write AZ; = Zi+1Ri+1, where Z;41 is orthogonal and R;+1 is upper 
triangular. Then ZE AZ; = ZI (Zi41 Ris) is the product of an orthogonal 
matrix Q = Z/Z;,, and an upper triangular matrix R = Riz. = Zi AZi; 
this must be the QR decomposition A; = QR, since the QR decomposition 
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is unique (except for possibly multiplying each column of Q and row of R by 
—1). Then 


Zi AZin = (Za AZZI Ziy) = Ri (Z7 Zi41) = RQ. 


This is precisely how the QR iteration maps A; to A;+1, so ZI AZiyı = Åi+ı 
as desired. 
To see a variety of numerical examples illustrating the convergence of QR 
iteration, run the Matlab code referred to in Question 4.15. 
From earlier analysis, we know that the convergence rate depends on the 
ratios of eigenvalues. To speed convergence, we use shifting and inverting. 


ALGORITHM 4.5. QR iteration with a shift: Given Ao, we iterate: 
i=0 
repeat 
Choose a shift oci near an eigenvalue of A 
Factor A; — 9,1 = Q;R; (QR decomposition) 
Aisi = RiQi + ol 
i=it+l 
until convergence 


LEMMA 4.4. A; and Aji1 are orthogonally similar. 


Proof. Aisi = RiQ; + oi = QFQiR:iQ:i + 7:Q7Q; = QT (QiRi + oi1)Qi = 
QT AiQi. 


If R; is nonsingular, we may also write 


Ant = RiQi+ ol = RiQiRiR7' + oRiR7' = Ri(QiRi + oil) R7" 
= RAR". 


If c; is an exact eigenvalue of A;, then we claim that QR iteration converges 
in one step: since g; is an eigenvalue, A; —o;/ is singular, so R; is singular, and 
so some diagonal entry of R; must be zero. Suppose Rijn» = 0. This implies 
that the last row of RQ; is 0, so the last row of Aji, = R;Q;+o0;I equals ciet, 
where en is the nth column of the n-by-n identity matrix. In other words, the 
last row of Aj, is zero except for the eigenvalue c; appearing in the (n,n) 
entry. This means that the algorithm has converged, because A;41 is block 
upper triangular, with a trailing 1-by-1 block o;; the leading (n — 1)-by-(n — 1) 
block A’ is a new, smaller eigenproblem to which QR iteration can be solved 

A a 
0 ci |. 

When cg; is not an exact eigenvalue, then we will accept A;+1(n, n) as having 
converged when the lower left block Aj+1(n,1:n— 1) is small enough. Recall 
from our earlier analysis that we expect A;i+ı(n, 1 : n— 1) to shrink by a factor 
[Ax — c;|/ minj= |À; — c;ļ, where [Ap — o;| = min; |A; — a;l. So if o; is a very 
good approximation to eigenvalue Az, we expect fast convergence. 


without ever modifying o; again: Aji1 = | 
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Here is another way to see why convergence should be fast, by recognizing 
that QR iteration is implicitly doing inverse iteration. When g; is an exact 
eigenvalue, the last column g, of Q; will be a left eigenvector of A; for eigenvalue 
gi, since g Ai = G (QR + of) = eZ Ri + oid; = Tig. When cg; is close to an 
eigenvalue, we expect q; to be close to an eigenvector for the following reason: 
q, is parallel to ((A; — oi )*)~ten (we explain why below). In other words q; 
is the same as would be obtained from Inverse Iteration on (A; — o;J)* (and 
so we expect it to be close to a left eigenvector). 

Here is the proof that q, is parallel to ((A; — o;1)*)~ ten. A; — oi) = QiR; 
implies (A; — oj! Vie = Q;. Inverting and taking the conjugate transpose of 
both sides leave the right-hand side Q; unchanged and change the left-hand side 
to ((A; — cil )*) 7} Rt, whose last column is ((A; — o:I)*)~! - [0,...,0, Rinn]?, 
which is proportional to the last column of ((A; — o;J)*)~?. 

How do we choose o; to be an accurate approximate eigenvalue, when we 
are trying to compute eigenvalues in the first place? We will say more about 
this later, but for now note that near convergence to a real eigenvalue Aj nn 
is close to that eigenvalue, so c; = Ajmn is a good choice of shift. In fact, 
it yields local quadratic convergence, which means that the number of correct 
digits doubles at every step. We explain why quadratic convergence occurs as 
follows: Suppose at step i that ||A;(n,1 : n — D/A] = 7 « 1. If we were 
to set A;(n,1 : n — 1) to exactly 0, we would make A; block upper triangular 
and so perturb a true eigenvalue A, to make it equal to Aj(n,n). If this 
eigenvalue is far from the other eigenvalues, it will not be ill-conditioned, so 
this perturbation will be O(7). In other words, |A;, — Ai(n,n)| = O(n|| A||). On 
the next iteration, if we choose g; = A;(n,n), we expect Aj4i(n,1:—1) to 
shrink by a factor |A —o;|/ minj=k |A; — oil] = O(n), implying that || Aj,i1(n,1: 
n —1)|| = O(n? |All), or || Aizi(n,1 : n — 1)||/||Al]| = O(n’). Decreasing the 
error this way from 7 to O(n?) is quadratic convergence. 


EXAMPLE 4.7. Here are some shifted QR iterations starting with the same 
4-by-4 matrix Ap as in Example 4.5, with shift o; = A;(4,4). The convergence 
is a bit erratic at first but eventually becomes quadratic near the end, with 
| A;(4, 1 : 3)|| ~ |A;(4,3)| approximately squaring at each of the last three 
steps. Also, the number of correct digits in A;(4,4) doubles at the fourth 
through second-to-last steps. 


Ao(4,:)= +19 +56. +40. —11.558 

Ai(4,:)=  —.85 —4.9 +2.2- 107? —6.6068 

A2(4,:) =  +.35 +.86 +.30 0.74894 

A3(4,:)=  =—1.2.107? —.17 —.70 1.4672 

A4(4,:) =  —-1.5-107* -1.8-107? —.38 1.4045 

As(4,:)=  —3.0-10°° —2.2.107° —.50 1.1403 

Ae(4,:)=  —14-107°8 -6.3-107> —7.8-107? 1.0272 

A7(4,:)= 14-107" -3.6-1077  -2.3-107% 0.99941 

As(4,:)= +2.8-107'® +44.2-107'' 41.4-107® — 0.9999996468853453 
Ag(4,:)= —3.4-107°4 —3.0-107'8 —4.8-107'% 0.9999999999998767 
Aio(4,:)= +1.5-1078% +7.4-107%? +6.0-10776 — 1.000000000000001 
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By the time we reach Aio, the rest of the matrix has made a lot of progress toward 
convergence as well, so later eigenvalues will be computed very quickly, in one or two steps 
each: 


30.000 —32.557 —70.844 14.985 
viene | 6.1548 - 107° 6.0000 1.8143 —.55754 | 
2.5531 - 1071? 2.0120-107° 2.0000 —.25894 |` 
| 1.4692 - 1078 7.4289 - 107°? 6.0040 - 107? 1.0000 | 


4.4.5. Making QR Iteration Practical 


Here are some remaining problems we have to solve to make the algorithm 
more practical: 


1. The iteration is too expensive. The QR decomposition costs O(n?) flops, 
so if we were lucky enough to do only one iteration per eigenvalue, the 
cost would be O(n*). But we seek an algorithm with a total cost of only 


O(n). 


2. How shall we choose o; to accelerate convergence to a complex eigen- 
value? Choosing c; complex means all arithmetic has to be complex, 
increasing the cost by a factor of about 4 when A is real. We seek an 
algorithm that uses all real arithmetic if A is real and converges to real 
Schur form. 


3. How do we recognize convergence? 


The solutions to these problems, which we will describe in more detail later, 
are as follows: 


1. We will initially reduce the matrix to upper Hessenberg form; this means 
that A is zero below the first subdiagonal (i.e., aj; = 0 if i > j + 1) (see 
section 4.4.6). Then we will apply a step of QR iteration implicitly, i.e., 
without computing Q or multiplying by it explicitly (see section 4.4.8). 
This will reduce the cost of one QR iteration from O(n?) to O(n?) and 
the overall cost from O(n*) to O(n?) as desired 


When A is symmetric we will reduce it to tridiagonal form instead, re- 
ducing the cost of a single QR iteration further to O(n). This is discussed 
in section 4.4.7 and Chapter 5. 


2. Since complex eigenvalues of real matrices occur in complex conjugate 
pairs, we can shift by g; and g; simultaneously; it turns out that this 
will permit us to maintain real arithmetic (see section 4.4.8). If A is 
symmetric, all eigenvalues are real, and this is not an issue. 


3. Convergence occurs when subdiagonal entries of A; are “small enough.” 
To help choose a practical threshold, we use the notion of backward sta- 
bility: Since A; is related to A by a similarity transformation by an or- 
thogonal matrix, we expect A; to have roundoff errors of size O(¢)|| A|| in 
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it anyway. Therefore, any subdiagonal entry of A; smaller than O(e)||A|| 
in magnitude may as well be zero, so we set it to zero.!°. When A is 


upper Hessenberg, setting ap+41,) to zero will make A into a block upper 


ai a ], where Aj, is p-by-p and Aj; and 


Aə are both Hessenberg. Then the eigenvalues of Aj; and Ag2 may be 
found independently to get the eigenvalues of A. When all these diagonal 
blocks are 1-by-1 or 2-by-2, the algorithm has finished. 


triangular matrix A = | 


4.4.6. Hessenberg Reduction 


Given a real matrix A, we seek an orthogonal Q so that QAQ? is upper 
Hessenberg. The algorithm is a simple variation on the idea used for the QR 
decomposition. 


EXAMPLE 4.8. We illustrate the general pattern of Hessenberg reduction with 
a 5-by-5 example. Each Q; below is a 5-by-5 Householder reflection, chosen to 
zero out entries 7 + 2 through n in column 27 and leaving entries 1 through i 
unchanged. 


1. Choose Q so 


T T T T 
T T T A 
QA=| o al and Ai = Q1AQi = | 0 ale 


8 8 8 8 8 
88 8 8 8 
88 8 8 8 
8 8 8 8 8 
88 8 8 8 
8 8 8 8 8 


x 
z | 
Qı leaves the first row of Q1 A unchanged, and QF leaves the first column 
of Q, AQ? unchanged, including the zeros. 


2. Choose Q2 so 


Q241 = 


oo 8 8 8 
8 8 8 8 8 
8 8 8 8 8 


x 
x 
x and Aj = Q241Q} = 
x 
x 


o 0 8 

oO OS. 8 
oo 8 8 8 
8 8 8 8 8 
88 8 8 8 
8 8 8 8 8 


Q2 changes only the last three rows of Aj, and QJ leaves the first two 
columns of Q2A1Q unchanged, including the zeros. 


Tn practice, we use a slightly more stringent condition, replacing ||Al| with the norm of 
a submatrix of A, to take into account matrices which may be “graded” with large entries 
in one place and small entries elsewhere. We can also set a subdiagonal entry to zero when 
the product @p+1,»@p+2,p+1 of two adjacent subdiagonal entries is small enough. See the 
LAPACK routine slahqr for details. 
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3. Choose Q3 so 


cL xe 2 2 et Ee aE 
| g o w Gd 2a go toto 2 | 
Q34o=|0 rtrt and A3 = Q3A2Q3 =|/onr x2 2 
oon £2 2 oo xn “2 2 | 
0002 2 0002 2 


which is upper Hessenberg. Altogether A3 = (Q3Q2Q1)-A(Q3Q2Q1)? = 
QAQ". o 


The general algorithm for Hessenberg reduction is as follows. 


ALGORITHM 4.6. Reduction to upper Hessenberg form: 


if Q is desired, set Q = I 

fori=1:n-2 
u;i = House(A(i + 1 : n,7)) 
P; = I — 2ujut /* Qi = diag(I™**, P;) */ 
AQ@it+t1l:nji:n)=P,-AG+1:n,i:n) 
A(l:n,i+1:n)=A(1:n,i+1:n)-P; 
if Q is desired 

Qli+1:n,i:n)= P; Qli+1:n, i:n) IEOS QE Q TY 

end if 

end for 


As with the QR decomposition, one does not form P; explicitly but instead 
multiplies by J — 2u;uf via matrix-vector operations. The u; vectors can also 
be stored below the subdiagonal, similar to the QR decomposition. They can 
be applied using Level 3 BLAS, as described in Question 3.17. This algorithm 
is available as the Matlab command hess or the LAPACK routine sgehrd. 

The number of floating point operations is easily counted to be Pn} + 
O(n?), or lin? + O(n?) if the product Q = Qn—1---Q1 is computed as well. 

The advantage of Hessenberg form under QR iteration is that it costs only 
6n? + O(n) flops instead of O(n?), and its form is preserved so that the matrix 
remains upper Hessenberg. 


PROPOSITION 4.5. Hessenberg form is preserved by QR iteration. 


Proof. It is easy to confirm that the QR decomposition of an upper Hessenberg 
matrix like A; — oI yields an upper Hessenberg Q (since the jth column of Q 
is a linear combination of the leading j columns of A; — oJ). Then it is easy 
to confirm that RQ remains upper Hessenberg and adding oJ does not change 
this. 
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DEFINITION 4.5. A Hessenberg matrix H is unreduced if all subdiagonals are 
nonzero. 


It is easy to see that if H is reduced because hj+1,; = 0, then its eigenvalues 
are those of its leading i-by-i Hessenberg submatrix and its trailing (n — i)-by- 
(n — i) Hessenberg submatrix, so we need consider only unreduced matrices. 


4.4.7. Tridiagonal and Bidiagonal Reduction 


If A is symmetric, the Hessenberg reduction process leaves A symmetric at 
each step, so zeros are created in symmetric positions. This means we need 
work on only half the matrix, reducing the operation count to $n? + O(n?) or 
in? + O(n?) to form Qn—1...Q1 as well. We call this algorithm tridiagonal 
reduction. We will use this algorithm in Chapter 5. This routine is available 
as LAPACK routine ssytrd. 

Looking ahead a bit to our discussion of computing the SVD in section 5.4, 
we recall from section 3.2.3 that the eigenvalues of the symmetric matrix AT A 
are the squares of the singular values of A. Our eventual SVD algorithm will 
use this fact, so we would like to find a form for A which implies that ATA is 
tridiagonal. We will choose A to be upper bidiagonal, or nonzero only on the 
diagonal and first superdiagonal. Thus, we want to compute orthogonal ma- 
trices Q and V such that QAV is bidiagonal. The algorithm, called bidiagonal 
reduction, is very similar to Hessenberg and tridiagonal reduction. 


EXAMPLE 4.9. Here is a 4-by-4 example of bidiagonal reduction, which illus- 
trates the general pattern: 


1. Choose Qı so 


T LETT x o d 

O TTT oraa g 
QA = and V so that Aj = Qı AV, = 

OTT O TTT 

ox ë r ox E 2 


Qı is a Householder reflection, and Vı is a Householder reflection that 
leaves the first column of Qı A unchanged. 


2. Choose Q2 so 


———1 

8 
oo 8 8 
SX RRO 
8 RRO 
o O0 0 8 
o Oo 8 8 
Sa RRO 
8 8 0 8 


Q2A) = | É | and V2 so that Ap = Q241 V2 = | 
o 
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Q2 is a Householder reflection that leaves the first row of Ay unchanged. 
V2 is a Householder reflection that leaves the first two columns of Q241 
unchanged. 


3. Choose Q3 so 


rT z 0 0 
Q3A = 2 : i : and V3 = I so that A3 = Q3Ao. 
0008 


Q3 is a Householder reflection that leaves the first two rows of A> un- 
changed. © 


In general, if A is n-by-n then we get orthogonal matrices Q = Qn-1---:Q1 
and V = Vi --- Vn—2 such that QAV = A’ is upper bidiagonal. 

Note that A” A’ = VT ATQTQAV = VT ATAV, so A” A’ has the same 
eigenvalues as AT A; i.e., A’ has the same singular values as A. 

The cost of this bidiagonal reduction is Sn? + O(n?) flops, plus another 
4n? + O(n?) flops to compute Q and V. This routine is available as LAPACK 
routine sgebrd. 


4.4.8. QR Iteration with Implicit Shifts 


In this section we show how to implement QR iteration cheaply on an upper 
Hessenberg matrix. The implementation will be implicit in the sense that we do 
not explicitly compute the QR factorization of a matrix H but rather construct 
Q implicitly as a product of Givens rotations and other simple orthogonal 
matrices. The implicit Q theorem described below shows that this implicitly 
constructed Q is the Q we want. Then we show how to incorporate a single shift 
a, which is necessary to accelerate convergence. To retain real arithmetic in 
the presence of complex eigenvalues, we then show how to do a double shift, i.e., 
combine two consecutive QR iterations with complex conjugate shifts a and g; 
the result after this double shift is again real. Finally, we discuss strategies for 
choosing shifts o and g to provide reliable quadratic convergence. However, 
there have been recent discoveries of rare situation where convergence does not 
occur [25, 64], so finding a completely reliable and fast implementation of QR 
iteration remains an open problem. 


Implicit Q Theorem 


Our eventual implementation of QR iteration will depend on the following 
theorem. 


THEOREM 4.9. Implicit Q theorem. Suppose that QT AQ = H is unreduced 
upper Hessenberg. Then columns 2 through n of Q are determined uniquely 
(up to signs) by the first column of Q. 
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This theorem implies that to compute Aj41 = QT AiQ: from A; in the QR 
algorithm, we will need only to 


1. compute the first column of Q; (which is parallel to the first column of 
A; — cil and so can be gotten just by normalizing this column vector). 


2. choose other columns of Q; so Q; is orthogonal and Aj;+1 is unreduced 
Hessenberg. 


Then by the implicit Q theorem, we know that we will have computed 
Aj+1 correctly because Q; is unique up to signs, which do not matter. (Signs 
do not matter because changing the signs of the columns of Q; is the same 
as changing A; — oJ = Q;R; to (Q;S;)(S;R;), where S; = diag(+1,...,+1). 
Then Aii = (S;Ri)(QiS;) + oil = Si(RiQi + cil) Si, which is an orthogonal 
similarity that just changes the signs of the columns and rows of A;+1.) 
Proof of the implicit Q theorem. Suppose that QT AQ = H and V7 AV = G are 
unreduced upper Hessenberg, Q and V are orthogonal, and the first columns 
of Q and V are equal. Let (X); denote the ith column of X. We wish to show 
(Q); = +(V); for all i > 1, or equivalently, that W = V7 Q = diag(+1,..., +1). 

Since W = V7Q, we get GW = GVTQ = VTAQ = V'QH = WH. 
Now GW = WH implies G(W); = (GW); = (WH); = 54) hj:(W);, so 
hizi (W yi = G(W); = Ya hji(W);. Since (W)i = (1, 0, sis 5 0] and G is 
upper Hessenberg, we can use induction on 7 to show that (W); is nonzero in 
entries 1 to ¿i only; i.e., W is upper triangular. Since W is also orthogonal, W 
is diagonal = diag(+1,...,+1). 


Implicit Single Shift QR Algorithm 


To see how to use the implicit Q theorem to compute A, from Ag = A, we use 
a 5-by-5 example. 


EXAMPLE 4.10. 1. Choose 


Ci S81 De OE ie Es i, 
=s] Cy EO Gy ee oe 
QT = 1 so Ay = QT AQI mt |. SE as e Ea S 
1 O 0x2 £ g 
| 1 | | o 0 0@%tz™T | 


We discuss how to choose cı and sı below; for now they may be any 
Givens rotation. The ‘+’ in position (3,1) is called a bulge and needs to 
be gotten rid of to restore Hessenberg form. 
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2. Choose 


8 8 8 8 8 
8 8 8&8 & 8 
8 8 RRO 
8 R 8 8 8 
880900 
— i 

II 

Laal 

= 

hin 

Oo 

o) 

A 
= 

= 

NA 

D Ò 

AN 

Sm 

= 

II 

hin 

Q 


and 


T T TTT 
T T TTT 


On SE Se br 


Thus the bulge has been “chased” from (3,1) to (4,2). 


3. Choose 


o O TTT 
o O O X T 


and 


T £ L TTT 
T T T TT 


O0 Fo ET 
oO O0O+ XL 2 


The bulge has been chased from (4,2) to (5,3). 


4. Choose 


m 
S So 8 
8 8 RRR 
8 8 RRO 
8 8 8 8 8 
8 R OOOO 
II 
GA 
< 
Shes 
Q 
° 
3) 
-————_ 
Se) 
ND O 
aa 
oy 
= 
= 
= 
ooo 
II 
ux 
Q 


and 


LT L LTT 
LT L LX XL LX 


Oo O TTT 


Oo O O ZT 
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so we are back to upper Hessenberg form. 
Altogether QT AQ is upper Hessenberg, where 


Oa ox TTT 
| S3 T TTT | 
Q = Q1Q2Q3Q4 = S T £ 2j, 
S3 £ £ 
| 


so the first column of Q is [c1, 1,0,...,0]’, which by the implicit Q theorem 
has uniquely determined the other columns of Q (up to signs). We now choose 
the first column of Q to be proportional to the first column of A — oT, [a11 — 
7,421,0,...,0]’. This means Q is the same as in the QR decomposition of 
A-—ol, as desired. © 


The cost for an n-by-n matrix is 6n? + O(n). 


Implicit Double Shift QR Algorithm 


This section describes how to maintain real arithmetic by shifting by o and F 
at the same time. This is essential for an efficient practical implementation but 
not for a mathematical understanding of the algorithm and may be skipped 
on a first reading. 

The results of shifting by o and o in succession are 


Ajp-ol = QiRi, 
RiQitol so Ay = Q?PAQ:, 
Q2R2, 

Ag = ReQe+alI so Ag 


= 
l 
ab 
| 


Q37 A1Q2 = QZ Qf AoQ1Qz. 
LEMMA 4.5. We can choose Qı and Q2 so 


(1) Q1Q2 is real, 
(2) Ag is therefore real, 
(3) the first column of Q1Q2 is easy to compute. 


Proof. Since Q2Rz2 = Aı — GI = Ri Qi + (o —G)I, we get 


QıQ2R2Rı = Q1(RiQi4+(o-a)I)Ri 

QikiQiki + (o —- 7)Q1Rı 

= (Ao—cI)(Ao-— ol) + (0 —7F)(Ao — oT) 
Ag — 2(Ro)Ao + |o|?I = M. 


Thus (Q1Q2)(R2R1) is the QRD of the real matrix M and so Q1Qz, as well 
as RəRı, can be chosen real. This means that Az = (Q1Q2)! A(Q1Q2) also is 
real. 
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The first column of Q1Qz2 is proportional to the first column of AZ—2Ro Apt 
|o?|, which is 


a21 (a11 + a22 — 2(Ro)) 


| a, + a42a21 — 2(Ro)aqy + |a|? | 
| 21432 | 


0 


| | 


The rest of the columns of Q1Q2 are computed implicitly using the implicit 
Q theorem. The process is still called “bulge chasing,” but now the bulge is 
2-by-2 instead of 1-by-1. 


EXAMPLE 4.11. Here is a 6-by-6 example of bulge chasing. 


], where the first column of QT is given as above, 


Tp QI 
1. Choose Qi =[ Y 7 


SO 


QTA = and A; = QT AQ. = 


o oo+tR8eg 
oo Oo 2 a 8 
oo 8 8 8 8 
(o Ae A e e 
88 8 8§ 8 & 
8 8 8 & & 8 
eo +4+8 8 
o o +8 BB 
oe 0 8 8 8 8 
o 8 8 & & 8 
88 8 8§ 8 & 
8 8 8 8&8 & 8 


We see that there is a 2-by-2 bulge, indicted by plus signs. 


2. Choose a Householder reflection QZ, which affects only rows 2, 3, and 4 
of Qf Ai, zeroing out entries (3,1) and (4,1) of A: (this means that QF 
is the identity matrix outside rows and columns 2 through 4): 


290989 (0 8 8 
oo te 88 
oo 8 BBB 
OS BRBB 
RR RRB 
Rae RRRB 
oo 080€[ r E 
ottees8 
ote 8 88 
oe RRRR 
Rae RRR 


and the 2-by-2 bulge has been “chased” one column. 


3. Choose a Householder reflection QF, which affects only rows 3, 4, and 5 
of Q3 Av, zeroing out entries (4,2) and (5,2) of Az (this means that QF 


88 8 8 8 8 
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is the identity outside rows and columns 3 through 5): 


Do Pe EE BE BC: OP oe ae 
Par] oe Te TE] ad æsa 9 oe oe 
Oo o taza] O07 EnD £2 
00 0 OTT (o E o+ + 2 &£ 


4. Choose a Householder reflection QF which affects only rows 4, 5, and 6 
of Qf As, zeroing out entries (5,3) and (6,3) of A3 (this means that QT 
is the identity matrix outside rows and columns 4 through 6): 


cue ev xx x 2 

eee Be & 

AT ox x x “£2 

aoe | . 

oOo o0z% 

000 “£ “£ 

oOo 00 + Tx 

5. Choose 
1 T: EE LOR EP 
1 Le Bae Le EP 
1 O28 £2 2 2 
Q} = so As = Qf AsQs = .o 

1 oon x £ 
C sS 000 2 4 2 
—s Cc o o0oo0o0zrtzrr 


Choosing a Shift for the QR Algorithm 


To completely specify one iteration of either single shift or double shift Hes- 
senberg QR iteration, we need to choose the shift o (and F). Recall from the 
end of section 4.4.4 that a reasonable choice of single shift, one that resulted 
in asymptotic quadratic convergence to a real eigenvalue, was o€ = an,n, the 
bottom right entry of A;. The generalization for double shifting is to use the 
Francis shift, which that means o and @ are the eigenvalues of the bottom 2-by- 


[ An-1,n-1 An—-1,n ] 


2 corner of A;: This will let us converge to either two real 


An,n-1 an,n 

eigenvalues in the bottom 2-by-2 corner or a single 2-by-2 block with complex 
conjugate eigenvalues. When we are close to convergence, we expect an—1,n-2 
(and possibly an,n—1) to be small so that the eigenvalues of this 2-by-2 matrix 
are good approximations for eigenvalues of A. Indeed, one can show that this 
choice leads to quadratic convergence asymptotically. This means that once 
An—1n—2 (and possibly an »—1) is small enough, its magnitude will square at 


ee | 
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each step and quickly approach zero. In practice, this works so well that on 
average only two QR iterations per eigenvalue are needed for convergence for 
almost all matrices. 

In practice, the QR iteration with the Francis shift can fail to converge 
(indeed, it leaves 


0 0 1 
1 0 0 
0 1 0 


unchanged). So the practical algorithm in use for decades had an “exceptional 
shift” every 10 shifts if convergence had not occurred. Still, tiny sets of 
matrices where that algorithm did not converge were discovered only recently 
[25, 64]; matrices in a small neighborhood of 


[i o no 
k zai 
0 0 10 


where h is a few thousand times machine epsilon, form such a set. So another 
“exceptional shift” was recently added to the algorithm to patch this case. 
But it is still an open problem to find a shift strategy that guarantees fast 
convergence for all matrices. 


4.5. Other Nonsymmetric Eigenvalue Problems 


4.5.1. Regular Matrix Pencils and Weierstrass Canonical Form 


The standard eigenvalue problem asks for which scalars z the matrix A — zI 
is singular; these scalars are the eigenvalues. This notion generalizes in several 
important ways. 


DEFINITION 4.6. A — AB, where A and B are m-by-n matrices, is called a 
matrix pencil, or just a pencil. Here A is an indeterminate, not a particular, 
numerical value. 


DEFINITION 4.7. If A and B are square and det(A — AB) is not identically 
zero, the pencil A— AB is called regular. Otherwise it is called singular. When 
A — AB is regular, p(X) = det(A — AB) is called the characteristic polynomial 
of A — AB and the eigenvalues of A — AB are defined to be 


(1) the roots of p(X) = 0, 
(2) co (with multiplicity n — deg(p)) if deg(p) < n. 
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EXAMPLE 4.12. Let 


A-AB= 1 — À 0 
0 1 


Then p(A) = det(A — AB) = (1 — 2A) - (1 — 0A) - (0 — A) = (2A — 1)A, so the 
eigenvalues are À = 5 Oand œ. © 


Matrix pencils arise naturally in many mathematical models of physical 
systems; we give examples below. The next proposition relates the eigenvalues 
of a regular pencil A — AB to the eigenvalues of a singular matrix. 


PROPOSITION 4.6. Let A—AB be regular. If B is nonsingular, all eigenvalues 
of A— XB are finite and the same as the eigenvalues of AB! or B-'A. If B 
is singular, A— AB has eigenvalue co with multiplicity n — rank(B). If A is 
nonsingular, the eigenvalues of A— AB are the same as the reciprocals of the 
eigenvalues of A~'B or BA™', where a zero eigenvalue of A~'B corresponds 
to an infinite eigenvalue of A— AB. 


Proof. If B is nonsingular and A’ is an eigenvalue, then 0 = det(A — A'B) = 
det(AB~! — XI) = det(B~!A — NT), so 0 is also an eigenvalue of ABT! and 
B-A. If B is singular, then take p(\) = det(A — AB), write the SVD of B as 
B = U£VŤ, and substitute to get 


p(A) = det(A — AUEVT) = det(U(UT AV — AX)V*) = det (UT AV — AD). 


Since rank(B) = rank(%), only rank(B) ’s appear in UTAV — AX, so the 
degree of the polynomial det(U? AV — A£) is rank(B). 

If A is nonsingular, det(A — AB) = 0 if and only if det(I — AAT! B) = 0 or 
det(I — \BA~') = 0. This can happen only if A = 0 and 1/) is an eigenvalue 
of A~!B or BAM}. 


DEFINITION 4.8. Let \’ be a finite eigenvalue of the regular pencil A — AB. 
Then x = 0 is aright eigenvector if (A—’ B)x = 0, or equivalently Ax = X Ba. 
If N = œ is an eigenvalue and Bx = 0, then x is a right eigenvector. A left 
eigenvector of A— XB is a right eigenvector of (A — AB)*. 


EXAMPLE 4.13. Consider the pencil A— AB in Example 4.12. Since A and B 
are diagonal, the right and left eigenvectors are just the columns of the identity 
matrix. © 


EXAMPLE 4.14. Consider the damped mass-spring system from Example 4.1. 
There are two matrix pencils that arise naturally from this problem. First, we 
can write the eigenvalue problem 

—-M—1B -MK 


Ar = I 0 r= Ax 
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as 


PP ooleal rls 


This may be a superior formulation if M is very ill-conditioned, so that M~!B 
and M~!K are hard to compute accurately. 

Second, it is common to consider the case B = 0 (no damping), so the 
original differential equation is M(t) + Ka(t) = 0. Seeking solutions of the 
form 2;(t) = e*2;(0), we get A?e™t Ma; (0) + eK x;(0) = 0, or AÀ? Mz;(0) + 
Ka;(0) = 0. In other words, —A? is an eigenvalue and 2;(0) is a right eigen- 
vector of the pencil K — AM. Since we are assuming that M is nonsingular, 
these are also the eigenvalue and right eigenvector of M~!K. o 


Infinite eigenvalues also arise naturally in practice. For example, later 
in this section we will show how infinite eigenvalues correspond to impulse 
response in a system described by ordinary differential equations with linear 
constraints, or differential-algebraic equations [40]. See also Question 4.16, 
for an application of matrix pencils to computational geometry and computer 
graphics. 

Recall that all of our theory and algorithms for the eigenvalue problem of a 
single matrix A depended on finding a similarity transformation STIAS of A 
that is in “simpler” form than A. The next definition shows how to generalize 
the notion of similarity to matrix pencils. Then we show how the Jordan form 
and Schur form generalize to pencils. 


DEFINITION 4.9. Let Pr, and Pr be nonsingular matrices. Then pencils A—AB 
and Pp APR — AP, BPR are called equivalent. 


PROPOSITION 4.7. The equivalent regular pencils A—AB and P, APR—AP, BPR 
have the same eigenvalues. The vector x is a right eigenvector of A— AB if 
and only if Pare is a right eigenvector of Pp APR — AP,BPr. The vector y 

is a left eigenvector of A— AB if and only if eee me is a left eigenvector of 

P,APr — APL BPR. 


Proof. 
det(A — AB) = 0 if and only if det(P,(A — AB) Pr) = 0. 
(A — àB)x = 0 if and only if Pr(A — \B)PrP, 2 = 0. 
(A — AB)*y = 0 if and only if P§(A — AB)* P¥ (Př) ty = 0. 
The following theorem generalizes the Jordan canonical form to regular 
matrix pencils. 


THEOREM 4.10. Weierstrass canonical form. Let A — AB be regular. Then 
there are nonsingular Pr, and Pr such that 


P(A — AB) Ppr = diag( Jn Oa) = Ani,- Jng (Ang) = Angs Nma,- Nm), 
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where Jn, (Ai) is an ni-by-n; Jordan block with eigenvalue Ai, 


Ma | 


and Nm, is a “Jordan block for A = co with multiplicity mi,” 


i i | 
Lo. 
Nm; = = Im; — ÀJm; (0). 


For a proof, see [108]. 


Application of Jordan and Weierstrass Forms to Differential Equa- 
tions 


Consider the linear differential equation (t) = Ax(t) + f(t), (0) = zo. An 
explicit solution is given by x(t) = etzo + fe eAt—7) f(r)dr. If we know 
the Jordan form A = SJS~!, we may change variables in the differential 
equation to y(t) = S~'x(t) to get y(t) = Jy(t)+ S~' f(t), with solution y(t) = 
eFtyg + i eF(t—7) S-71 f(r)dr. There is an explicit formula to compute et or 
any other function f(J) of a matrix in Jordan form J. (We should not use this 
formula numerically! For the basis of a better algorithm, see Question 4.4.) 
Suppose that f is given by its Taylor series f(z) = 73° f ae J and J isa 
single Jordan block J = AI + N, where N has ones on the first superdiagonal 
and zeros elsewhere. Then 


< fOO AL + NY 


i! 


fU) = 


i=0 


œ (i) i ; fest 
- y f“ (0) `> ( ) ATIN? by the binomial theorem 


i=0 ` j=0 
; oe , f 
( , ) AIN? reversing the order of summation 


< S O0) 
= Dads il j 
- SoM ())o- 


where in the last equality we reversed the order of summation and used that 
fact that N? = 0 for j > n — 1. Note that N’ has ones on the jth superdiago- 


nal and zeros elsewhere. Finally, note that X372 j f OO) ( : )AŻÍ is the Taylor 


i! 
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expansion for fO (A)/j!. Thus 


> 1 i 
SE n=l Ni fO) (0) 
D = f = 
(J) ni 2 ji 
A 
a ra £2 am 
7 TA as 
2! 
| FO) | 
FA) 


so that f(J) is upper triangular with f©)(0)/j! on the jth superdiagonal. 

To solve the more general problem Ba = Ax + f(t), A— AB regular, we use 
the Weierstrass form: let P(A — AB)Pr be in Weierstrass form, and rewrite 
the equation as PLBPrRP}'i = Pi APpPp'& + Pr f(t). Let Pog = y and 
Pr f(t) = g(t). Now the problem has been decomposed into subproblems: 


Jym—r(0) 


Each subproblem ¥ = Jn,(Ai)9 + 9(t) = J9+9(t) is a standard linear ODE 
as above with solution 


a(t) = G(0)e"* + f lg (r)dr 


The solution of Jm(0)y = 7 + g(t) is gotten by back substitution starting 


from the last equation: write Jm(0)y = g + g(t) as 
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Pa Patlesl cell 
Pel ee lea 


The mth (last) equation says 0 = ym + gm OF Ym = —Jm. The ith equation 
says Yi+1 = Yi + Gi, SO Yi = Yi+ı — gi and thus 


m. dk—i 
Yi = ge ) 
k=i 


Therefore the solution depends on derivatives of g, not an integral of g as in 
the usual ODE. Thus a continuous g which is not differentiable can cause a 
discontinuity in the solution; this is sometimes called an impulse response, and 
occurs only if there are infinite eigenvalues. Furthermore, to have a continuous 
solution y must satisfy certain consistency conditions at t = 0: 


i dki 
TODD — g 9 (0). 


k=m 


Numerical methods, based on time-stepping, for solving such differential 
algebraic equations, or ODEs with algebraic constraints, are described in [40]. 


Generalized Schur Form for Regular Pencils 


Just as we cannot compute the Jordanform stably, we cannot compute its 
generalization by Weierstrass stably. Instead, we compute the generalized 
Schur form. 


THEOREM 4.11. Generalized Schur form. Let A — AB be regular. Then there 
exist unitary Qr and Qpr so that QLAQRr = T4 and QLBQR = Tp are both 
upper triangular. The eigenvalues of A — AB are then T4,,/Tp,,, the ratios of 
the diagonal entries of Ta and Tg. 


Proof. The proof is very much like that for the usual Schur form. Let A’ be 
an eigenvalue and zx a unit right eigenvector: ||x||2 = 1. Since Ax — X Bz = 0, 
both Az and Bz are multiples of the same unit vector y (even if one of Ax 


or Bz is zero). Now let X = |x, X] and Y = [y, Y] be unitary matrices with 


first columns x and y, respectively. Then Y*AX = [ ie E ] and Y*BX = 


[ bii bie 
0 Bz 


] by construction. Apply this process inductively to Ago — XBo2. 
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If A and B are real, there is a generalized real Schur form too: real or- 
thogonal Qz and Qpr, where Qr AQpR is quasi—-upper triangular and Qr BQ R is 
upper triangular. 

The QR algorithm and all its refinements generalize to compute the gener- 
alized (real) Schur form; it is called the QZ algorithm and available in LAPACK 
subroutine sgges. In Matlab one uses the command eig(A,B). 


Definite Pencils 


A simpler special case that often arises in practice is the pencil A— AB, where 
A= A’, B = B7, and B is positive definite. Such pencils are called definite 
pencils. 


THEOREM 4.12. Let A = A’, and let B = BT be positive definite. Then 
there is a real nonsingular matrix X so that XTAX = diag(a1,...,an) and 
XTBX = diag(G1,..., Bn). In particular, all the eigenvalues a;/3; are real 
and finite. 


Proof. The proof that we give is actually the algorithm used to solve the 
problem: 


(1) Let LLT = B be the Cholesky decomposition. 
(2) Let H = L~-'AL~T; note that H is symmetric. 
(3) Let H = QTAQ, with Q orthogonal, A real and diagonal. 


Then X = LTTQ satisfies XTAX = QTLTIALTTQ = A and XP BX = 
QPL BEF OST: 
Note that the theorem is also true if aA + B is positive definite for some 
scalars a and £. 
Software for this problem is available as LAPACK routine ssygv. 


EXAMPLE 4.15. Consider the pencil K — AM from Example 4.14. This is a 
definite pencil since the stiffness matrix K is symmetric and the mass matrix 
M is symmetric and positive definite. In fact, K is tridiagonal and M is 
diagonal in this very simple example, so M’s Cholesky factor L is also diagonal, 
and H = L~!KL~" is also symmetric and tridiagonal. In Chapter 5 we will 
consider a variety of algorithms for the symmetric tridiagonal eigenproblem. 
© 


4.5.2. Singular Matrix Pencils and the Kronecker 
Canonical Form 


Now we consider singular pencils A — AB. Recall that A — AB is singular if 
either A and B are nonsquare or they are square and det(A — AB) = 0 for 
all values of A. The next example shows that care is needed in extending the 
definition of eigenvalues to this case. 
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EXAMPLE 4.16. Let A = [ ; ; | and B = Í : 5 ]. Then by making arbi- 
trarily small changes to get A’ = [ k 7 ] and B’ = | 3 2 ], the eigenvalues 


become €;/e3 and €2/€4, which can be arbitrary complex numbers. So the 
eigenvalues are infinitely sensitive. © 


Despite this extreme sensitivity, singular pencils are used in modeling cer- 
tain physical systems, as we describe below. 

We continue by showing how to generalize the Jordan and Weierstrass forms 
to singular pencils. In addition to Jordan and “infinite Jordan” blocks, we get 
two new “singular blocks” in the canonical form. 


‘THEOREM 4.13. Kronecker canonical form. Let A and B be arbitrary rectan- 


gular m-by-n matrices. Then there are square nonsingular matrices Pr, and 
Pr so that P,APr — AP, BPR is block diagonal with four kinds of blocks: 


ee, | 


Jml A) -AI = , m-by-m Jordan block; 
1 
XÀ 
1 A 
m-by-m Jordan block 
Nig. = ; 
A for A = œ; 
1 
1 A 
m-by-(m + 1) right 
Ln = ; : 
singular block; 
| a | 
1 
Ee a A (m + 1)-by-m left 


singular block. 


We call Lm aright singular block because it has a right null vector [A™, —A™71,... 


for all A. LZ, has an analogous left null vector. 

For a proof, see [108]. 

Just as Schur form generalized to regular matrix pencils in the last section, 
it can be generalized to arbitrary singular pencils as well. For the canonical 
form, perturbation theory and software, see [27, 78, 244]. 

Singular pencils are used to model systems arising in systems and control. 
We give two examples. 


>a 


-1] 
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Application of Kronecker Form to Differential Equations 


Suppose that we want to solve Bt = Ax + f(t), where A — AB is a singular 
pencil. Write Pi BPpPz't = PLAPrRPR' x + Pr, f(t) to decompose the prob- 
lem into independent blocks. There are four kinds, one for each kind in the 
Kronecker form. We have already dealt with Jm(à')— AZI and Nm blocks when 
we considered regular pencils and Weierstrass form, so we have to consider 
only Lm and LI, blocks. From the Lm blocks we get 


0 1 V1 1 0 yı gı 
a 4 =, eA Eg 
0 1 Ym4+1 1 0 Ym+1 Jm 
or 
j = ytn or y(t) = y2(0) + fo(y(r) + gı(7))dr, 
ýs = y+g or — ys(t) = y3(0) + fo(yo(r) + go(r))dr, 
Umil = Yat Gm Or Ymsi(t) = Ym) + flum) + gm(7)) dr. 


This means that we can choose yı as an arbitrary integrable function and use 
the above recurrence relations to get a solution. This is because we have one 
more unknown than equation, so the the ODE is underdetermined. From the 
LZ, blocks we get 


i ee i. ot 
i yı : yı gı 
pess I a Ee] 
, Ea , pe t ; 
re ies >d EN ET 
1 0 
or 
0 = y+, 
Yi = Y2+92, 
Ym-1 = Ym+ 9m; 
Ym = Jml- 


Starting with the first equation, we solve to get 


Yi = fi, 
yo = —g2-H, 
qv-1 
Ym = Im — Jm-1 dpm gı 
and the consistency condition gm+1 = —Gm— `: Tag. So unless the gi 


satisfy this equation, there is no solution. Here we have one more equation 
then variable, and the subproblem is overdetermined. 
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Application of Kronecker Form to Systems and Control Theory 


The controllable subspace of «(t) = Ax(t) + Bu(t) is the space in which the 
system state x(t) can be “controlled” by choosing the control input u(t) starting 
at z(0) = 0. This equation is used to model (feedback) control systems, where 
the u(t) is chosen by the control system engineer to make x(t) have certain 
desirable properties, such as boundedness. From 


[e AQT) Bu(r yar = f X gA Bular 
Lain [S = Durar 


one can prove the controllable space is span{[B, AB, A?B,..., A”71B]}; any 
components of x(t) outside this space cannot by controlled by varying u(t). 
To compute this space in practice, in order to determine whether the physical 
system being modeled can in fact be controlled by input u(t), one applies a QR- 
like algorithm to the singular pencil [B, A — J]. For details, see [77, 244, 245]. 


x(t) 


4.5.3. Nonlinear Eigenvalue Problems 


Finally, we consider the nonlinear eigenvalue problem or matrix polynomial 
d . 
So NA = AAG HAT AG +++ + AA + Ao. (4.7) 


Suppose for simplicity that the A; are n-by-n matrices and Aq is nonsingular. 


DEFINITION 4.10. The characteristic polynomial of the matrix polynomial (4.7) 
is p(A) = det (3-4, d'A;). The roots of p(X) = 0 are defined to be the eigenval- 
ues. One can confirm that p(A) has degree d-n, so there are d-n eigenvalues. 
Suppose that y is an eigenvalue. A nonzero vector x satisfying re Aix = 0 
is a right eigenvector for y. A left eigenvector y is defined analogously by 
eae y'y* A; = 0. 


EXAMPLE 4.17. Consider Example 4.1 once again. The ODE arising there 
in equation (4.3) is Mz(t) + Ba(t) + Ka(t) = 0. If we seek solutions of the 
form z(t) = e'2;(0), we get e#(A?Maz,(0) + à;Bx:(0) + Kz:(0) = 0, or 
A? Mz;(0) + A,Bx;(0) + Kx;(0) = 0. Thus A; is an eigenvalue and 2;(0) is an 
eigenvector of the matrix polynomial AM +AB +K. o 


Since we are assuming that Ag is nonsingular, we can puny through 
by Az l to get the equivalent problem \¢I + Aj DA ea NA scien a A; Ao. 
Theile: to keep the notation simple, we will assume Ag = I a section 4.6 
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for the general case). In the very simplest case where each A; is 1-by-1, i.e., a 
scalar, the original matrix polynomial is equal to the characteristic polynomial. 

We can turn the problem of finding the eigenvalues of a matrix polynomial 
into a standard eigenvalue problem by using a trick analogous to the one used 
to change a high-order ODE into a first-order ODE. Consider first the simplest 
case n = 1, where each A; is a scalar. Suppose that y is a root. Then the 


vector x! = [y4-1,74-2,...,7,1]7 satisfies 

—Aa-1 Ages —Ao a nae 
1 ‘ee 0 yi7 
; iy 7 
0 0 1 0 y 

yi 
yt} 
= Sa 


Thus x’ is an eigenvector and y is an eigenvalue of the matrix C, which is 
called the companion matrix of the polynomial (4.7). 

(The Matlab routine roots for finding roots of a polynomial applies the 
Hessenberg QR iteration of section 4.4.8 to the companion matrix C, since this 
is currently one of the most reliable, if expensive, methods known [98, 115, 239]. 
Cheaper alternatives are under development.) 

The same idea works when the A; are matrices. C becomes an (n - d)-by- 
(n-d) block companion matrix, where the 1’s and 0’s below the top row become 
n-by-n identity and zero matrices, respectively. Also, x’ becomes 
d-1,, 

d-2 y 


y 
y 


yr 

z 
where «x is a right eigenvector of the matrix polynomial. It again turns out 
that Cx’ = yr’. 
EXAMPLE 4.18. Returning once again to A?M + AB + K, we first convert it 
to A? + \M-!B+ M~!K and then to the companion matrix 

-M-tB -MK 
C= | : ; | | 


This is the same as the matrix A in equation 4.4 of Example 4.1. © 
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Finally, Question 4.16 shows how to use matrix polynomials to solve a 
problem in computational geometry. 


4.6. Summary 


The following list summarizes all the canonical forms, algorithms, their costs, 
and applications to ODEs described in this chapter. It also includes pointers 
to algorithms exploiting symmetry, although these are discussed in more detail 
in the next chapter. Algorithms for sparse matrices are discussed in Chapter 7. 


e A-XI 
— Jordan form: For some nonsingular S, 
| N-A 1 i 
A—A\I = S-diag | 228 A k a cee 
Mi A 


— Schur form: For some unitary Q, A — AI = Q(T — AI) Q*, where T 
is triangular. 


— Real Schur form of real A: For some real orthogonal Q, A — AI = 
Q(T — AI)QT, where T is real quasi-triangular. 


— Application to ODEs: Provides solution of (t) = Ax(t) + f(t). 


— Algorithm: Do Hessenberg reduction (Algorithm 4.6), followed by 
QR iteration to get Schur form (Algorithm 4.5, implemented as 
described in section 4.4.8). Eigenvectors can be computed from the 
Schur form (as described in section 4.2.1). 


— Cost: This costs 10n? flops if eigenvalues only are desired, 25n? if 
T and Q are also desired, and a little over 27n? if eigenvectors are 
also desired. Since not all parts of the algorithm can take advantage 
of the Level 3 BLAS, the cost is actually higher than a comparison 
with the 2n° cost of matrix multiply would indicate: instead of tak- 
ing (10n3)/(2n°) = 5 times longer to compute eigenvalues than to 
multiply matrices, it takes 23 times longer for n = 100 and 19 times 
longer for n = 1000 on an IBM RS6000/590 [10, page 62]. Instead 
of taking (27n3)/(2n3) = 13.5 times longer to compute eigenvalues 
and eigenvectors, it takes 41 times longer for n = 100 and 60 times 
longer for n = 1000 on the same machine. Thus computing eigen- 
values of nonsymmetric matrices is expensive. (The symmetric case 
is much cheaper; see Chapter 5.) 


Nonsymmetric Eigenvalue Problems 185 


— LAPACK: sgees for Schur form or sgeev for eigenvalues and eigen- 
vectors; sgeesx or sgeevx for error bounds too. 


— Matlab: schur for Schur form or eig for eigenvalues and eigenvec- 
tors. 


— Exploiting symmetry: When A = A’*, better algorithms are dis- 
cussed in Chapter 5, especially section 5.3. 


e Regular A — AB (det(A — AB) £0) 
— Weierstrass form: For some nonsingular Pr, and PR, 
1 À | XN; 


A-— àB = P,- diag | Jordan, | = Par 
a x 


1 


— Generalized Schur form: For some unitary Qz and QR, A — ÀB = 
QL(Ta — àATgB)QR, where T4 and Tp are triangular. 


— Generalized real Schur form of real A and B: For some real orthog- 
onal Qz and Qr, A — AB = QL(Ta — ATB)Qh, where T4 is real 
quasi-triangular and Tp is real triangular. 


— Application to ODEs: Provides solution of Bi(t) = Ax(t) + f(t), 
where the solution is uniquely determined but may depend nons- 
moothly on the data (impulse response). 


— Algorithm: Hessenberg/triangular reduction followed by QZ itera- 
tion (QR applied implicitly to AB~'). 


— Cost: Computing T4 and Tg costs 30n?. Computing Qz and QR in 
addition costs 66n°. Computing eigenvectors as well costs a little 
less than 69n° in total. As before, Level 3 BLAS cannot be used in 
all parts of the algorithm. 


— LAPACK: sgges for Schur form or sggev for eigenvalues; sggesx 
or sggevx for error bounds too. 


— Matlab: eig for eigenvalues and eigenvectors. 


— Exploiting symmetry: When A = A*, B = B*, and B is positive 
definite, one can convert the problem to finding the eigenvalues 
of a single symmetric matrix using Theorem 4.12. This is done in 
LAPACK routines ssygv, sspgv (for symmetric matrices in “packed 
storage”) , and ssbgv (for symmetric band matrices). 


e Singular A — AB 
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— Kronecker form: For some nonsingular Pr, and Pp, 


MiX Mi 
1 À NiXNi 1 
A` h 


- diag | Weierstrass, Tete ; 
i 3 can 
À 


A-AB=P; 


— Generalized upper triangular form: For some unitary Qz and QR, 
A — àB = Q1(Ta — ATB)Q’, where T4 and Tp are in generalized 
upper triangular form, with diagonal blocks corresponding to dif- 
ferent parts of the Kronecker form. See [78, 244] for details of the 
form and algorithms. 


— Cost: The most general and reliable version of the algorithm can 
cost as much as O(n*), depending on the details of the Kronecker 
Structure; this is much more than for regular A — AB. There is also 
a slightly less reliable O(n?) algorithm [27]. 


— Application to ODEs: Provides solution of Ba(t) = Ax(t) + f(t), 
where the solution may be overdetermined or underdetermined. 


— Software: NETLIB/linalg/guptri. 
e Matrix polynomials Be AŻ A; [116] 


— If Ag = I (or Ag is square and well-conditioned enough to replace 
each A; by Aa Ay, then linearize to get the standard problem 


Sag Apts eek ae Salt Se 
I 0 G2, gaw Bet À 
0 I Üo ti Ai 4G Ar. 
0 0 I 0 


— If Ag is ill-conditioned or singular, linearize to get the pencil 


sA Aea A ae Ee. Ag 
I 0 I 


; ae G 
0 I O sss ae oH |- I 


eae eS Se a 


—1 
pal 
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4.7. References and Other Topics for Chapter 4 


For a general discussion of properties of eigenvalues and eigenvectors, see [137]. 
For more details about perturbation theory of eigenvalues and eigenvectors, see 
[159, 235, 51], and chapter 4 of [10]. For a proof of Theorem 4.7, see [68]. For 
a discussion of Weierstrass and Kronecker canonical forms, see [108, 116]. 
For their application to systems and control theory, see [244, 245, 77]. For 
applications to computational geometry, graphics, and mechanical CAD, see 
(179, 180, 163]. For a discussion of parallel algorithms for the nonsymmetric 
eigenproblem, see [75]. 


4.8. Questions for Chapter 4 


QUESTION 4.1. (Easy) Let A be defined as in equation (4.1). Show that 
det(A) = []?_, det(Aj;) and then that det(A — AZ) = []?_, det(Ay — AJ). 
Conclude that the set of eigenvalues of A is the union of the sets of eigenvalues 
of Aj, through A». 


QUESTION 4.2. (Medium; Z. Bai) Suppose that A is normal; i.e., AA* = A*A. 
Show that if A is also triangular, it must be diagonal. Use this to show that 
an n-by-n matrix is normal if and only if it has n orthonormal eigenvectors. 
Hint: Show that A is normal if and only if its Schur form is normal. 


QUESTION 4.3. (Easy; Z. Bai) Let A and u be distinct eigenvalues of A, let x 
be a right eigenvector for A, and let y be a left eigenvecctor for u. Show that 
x and y are orthogonal. 


QUESTION 4.4. (Medium) Suppose A has distinct eigenvalues. Let f(z) = 
A ə Uz’ be a function which is defined at the eigenvalues of A. Let Q* AQ = 


T be the Schur form of A (so Q is unitary and T upper triangular). 


1. Show that f(A) = Qf(T)Q*. Thus to compute f(A) it suffices to be 
able to compute f(T). In the rest of the problem you will derive a 
simple recurrence formula for f(T). 


2. Show that (f(T))i = f (Tx) so that the diagonal of f(T) can be computed 
from the diagonal of T. 


3. Show that Tf(T) = f(T)T. 


4. From the last result, show that the ith superdiagonal of f(T) can be 
computed from the (i — 1)st and earlier subdiagonals. Thus, starting 
at the diagonal of f(T), we can compute the first superdiagonal, second 
superdiagonal, and so on. 


188 Applied Numerical Linear Algebra 


QUESTION 4.5. (Easy) Let A be a square matrix. Apply either Question 4.4 
to the Schur form of A or equation (4.6) to the Jordan form of A to conclude 
that the eigenvalues of f(A) are f(A;), where the A; are the eigenvalues of A. 
This result is called the spectral mapping theorem. 

This question is used in the proof of Theorem 6.5 and section 6.5.6. 


QUESTION 4.6. (Medium) In this problem we will show how to solve the 
Sylvester or Lyapunov equation AX — XB = C, where X and C are m-by-n, 
A is m-by-m, and B is n-by-n. This is a system of mn linear equations for the 
entries of X. 


1. Given the Schur decompositions of A and B, show how AX —- XB=C 
can be transformed into a similar system A'Y — Y B’ = C’, where A’ and 
B' are upper triangular. 


2. Show how to solve for the entries of Y one at a time by a process analo- 
gous to back substitution. What condition on the eigenvalues of A and 
B guarantees that the system of equations is nonsingular? 


3. Show how to transform Y to get the solution X. 


A 


QUESTION 4.7. (Medium) Suppose that T = [ k - ] is in Schur form. We 


A 0 


0 B ]. It turns out we can choose 


want to find a matrix S so that SITS = [ 


S of the form [ te ol 


o z | Show how to solve for R. 


QUESTION 4.8. (Medium; Z. Bai) Let A be m-by-n and B be n-by-m. Show 


that the matrices 
AB 0 d 0 0 
B 0o} ® B BA 


are similar. Conclude that the nonzero eigenvalues of AB are the same as 
those of BA. 


QUESTION 4.9. (Medium; Z. Bai) Let A be n-by-n with eigenvalues \1,..., An. 
Show that 


di]? = min ||S~'AS||z, 
2 i, l Ifa 


QUESTION 4.10. (Medium; Z. Bai) Let A be an n-by-n matrix with eigenval- 
ues Ay,..-,An- 


1. Show that A can be written A = H+ S, where H = H* is Hermitian 
and S$ = —S$* is skew-Hermitian. Give explicit formulas for H and S' in 
terms of A. 


2. Show that X~] IRA]? < |EN}. 
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3. Show that S7”_, |SA;|? < ||S||?. 
4. Show that A is normal (AA* = A*A) if and only if S77_, |Ai|? = ||A||?. 
QUESTION 4.11. (Easy) Let \ be a simple eigenvalue, and let x and y be right 


and left eigenvectors. We define the spectral projection P corresponding to A 
as P = xry*/(y*x). Prove that P has the following properties. 


1. P is uniquely defined, even though we could use any nonzero scalar mul- 
tiples of x and y in its definition. 


2. P? = P. (Any matrix satisfying P? = P is called a projection matriz.) 


3. AP = PA = AP. (These properties motivate the name spectral projec- 
tion, since P “contains” the left and right invariant subspaces of A.) 


4. ||P||2 is the condition number of A. 


a c 


QUESTION 4.12. (Easy; Z. Bai) Let A=[, , |. 


numbers of the eigenvalues of A are both equal to (1 + (<S))!/?. Thus, the 
condition number is large if the difference a—b between the eigenvalues is small 
compared to c, the offdiagonal part of the matrix. 


Show that the condition 


QUESTION 4.13. (Medium, Z. Bai) Let A be a matrix, x be a unit vector 
(\|z|]2 = 1), u be a scalar, and r = Ax — yx. Show that there is a matrix E 
with ||/E||7 = ||r|/2 such that A+ E has eigenvalue u and eigenvector x. 


QUESTION 4.14. (Medium; Programming) In this question we 
will use a Matlab program to plot eigenvalues of a perturbed matrix and their 
condition numbers. .(It is available at HOMEPAGE/Matlab/eigscat.m.) The 
input is 

a = input matrix, 

err = size of perturbation, 

m = number of perturbed matrices to compute. 


The output consists of three plots in which each symbol is the location of an 
eigenvalue of a perturbed matrix: 


‘o’ marks the location of each unperturbed eigenvalue. 
‘x’ marks the location of each perturbed eigenvalue, where a real 
perturbation matrix of norm err is added to a. 
‘’ marks the location of each perturbed eigenvalue, where a com- 
plex perturbation matrix of norm err is added to a. 

A table of the eigenvalues of A and their condition numbers is also printed. 


Here are some interesting examples to try (for as large an m as you want 
to wait; the larger the m the better, and m equal to a few hundred is good). 
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(1) a 


randn(5) (if a does not have complex eigenvalues, 
try again) 
err=le-5, 1e-4, 1e-3, 1e-2, .1, .2 


(2) a = diag(ones(4,1),1); err=1te-12, 1e-10, 1e-8 


(3) a=[[1 1e6 0 0]; 
[0 2 1e-3 O]; 
(00 8 10]; 
[0 0 -1 4]] 
err=1e-8, 1e-7, 1e-6, 1e-5, 1e-4, 1e-3 


(4) [q,r]=qr(randn(4,4));a=q*diag(ones(3,1),1)*qą’ 
err=1e-16, 1e-14, 1e-12, 1e-10, 1e-8 


(5) a = [[1 1e3 1e6]; [0 1 1e3]; [0 0 1]], 
err=1e-7, 1e-6, 5e-6, 8e-6, 1e-5, 1.5e-5, 2e-5 
(6) a = [[1 o 0 0 0 OQ]; 
[0 2 1 o oO ol; 
fo 0 2 0 0 0]; 
[fo O OO 8 1e2 124]; 
[0 0 O O 8 1e2); 


[O 0 (0) 0 0 3]] 
err= 1e-10, 1e-8, 1e-6, 1e-4, 1e-3 


Your assignment is to try these examples and compare the regions occupied 
by the eigenvalues (the so-called pseudospectrum) with the bounds described 
in section 4.3. What is the difference between real perturbations and complex 
perturbations? What happens to the regions occupied by the eigenvalues as 
the perturbation err goes to zero? What is limiting size of the regions as err 
goes to zero (i.e., how many digits of the computed eigenvalues are correct)? 


QUESTION 4.15. (Medium; Programming) In this question we use a Matlab 
program to plot the diagonal entries of a matrix undergoing unshifted QR 
iteration. The values of each diagonal are plotted after each QR iteration, each 
diagonal corresponding to one of the plotted curves. (The program is available 
at HOMEPAGE/Matlab/qrplt.m and also shown below.) The inputs are 

a = input matrix, 

m = number of QR iterations, 
and the output is a plot of the diagonals. 

Examples to try this code on are as follows (choose m large enough so that 

the curves either converge or go into cycles): 


a = randn(6); 
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b = randn(6); a = b*diag([1,2,3,4,5,6])*inv(b) ; 
a = [[1 10];[-1 1]]; m = 300 
a = diag((1.5*ones(1,5)).\verb+7+(0:4)) + 


.01* (diag (ones(4,1),1)+diag(ones(4,1),-1)); m=30 

What happens if there are complex eigenvalues? 

In what order do the eigenvalues appear in the matrix after many itera- 
tions? 

Perform the following experiment: Suppose that a is n-by-n and symmetric. 
In Matlab, let perm=(n:-1:1). This produces a list of the integers from n down 
to 1. Run the iteration for m iterations. Let a=a(perm,perm); we call this 
“flipping” a, because it reverses the order of the rows and columns of a. Run 
the iteration again for m iterations, and again form a=a(perm,perm). How 
does this value of a compare with the original value of a? You should not let 
m be too large (try m = 5) or else roundoff will obscure the relationship you 
should see. (See also Corollary 5.4 and Question 5.25.) 

Change the code to compute the error in each diagonal from its final value 
(do this just for matrices with all real eigenvalues). Plot the log of this error 
versus the iteration number. What do you get asymptotically? 


hold off 

e=diag(a) ; 

for i=1:m, 
[q,rl=qr(a) ;dd=diag (sign (diag(r))) ;r=dd*r ; q=q*dd;a=r*q; 
e=[e,diag(a)]; 

end 

clg 

plot(e’,’w’),grid 


QUESTION 4.16. (Hard; Programming) This problem describes an application 
of the nonlinear eigenproblem to computer graphics, computational geometry, 
and mechanical CAD; see also [179, 180, 163]. 

Let F = [fij (£1, £2, £3)] be a matrix whose entries are polynomials in the 
three variables x;. Then det(F’) = 0 will (generally) define a two-dimensional 
surface S in 3-space. Let x1 = gi(t), v2 = g2(t), and x73 = g3(t) define a (one- 
dimensional) curve C parameterized by t, where the g; are also polynomials. 
We want to find the intersection SMC. Show how to express this as an 
eigenvalue problem (which can then be solved numerically). More generally, 
explain how to find the intersection of a surface det(F(a1,...,2%,)) = 0 and 
curve {x; = g(t), 1<i<mn}. At most how many discrete solutions can there 
be, as a function of n, the dimension d of F, and the maximum of the degrees 
of the polynomials fij and gẹ? 

Write a Matlab program to solve this problem, for n = 3 variables, by 
converting it to an eigenvalue problem. It should take as input a compact 
description of the entries of each fij;(a;,) and g;(t) and produce a list of the 
intersection points. For instance, it could take the following inputs: 
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Array NumTerms(1:d,1:d), where NumTerms(i,7) is the number of terms 
in the polynomial fij (x1, £2, £3). 


Array Sterms(1:4, 1:TotalTerms), where TotalTerms is the sum of all the 
entries in NumTerms(.,.). Each column of Sterms represents one term in 
one polynomial: The first NumTerms(1,1) columns of Sterms represent 
the terms in f11, the second Numterm(2,1) columns of Sterms represent 


the terms in f21, and so on. The term represented by Sterms(1:4,k) is 
Sterm(4, k) : p TEM . gen) ; ge, 


Array tC(1:3) contains the degrees of polynomials gi, g2, and g3 in that 
order. 


Array Curve(1: tC(1)+tC(2)+tC(3)+3) contains the coefficients of the 
polynomials gi, gz, and g3, one polynomial after the other, from the 
constant term to the highest order coefficient of each. 


Your program should also compute error bounds for the computed answers. 
This will be possible only when the eigenproblem can be reduced to one for 
which the error bounds in Theorems 4.4 or 4.5 apply. You do not have to 
provide error bounds when the eigenproblem is a more general one. (For a 
description of error bounds for more general eigenproblems, see [10, 235]. 

Write a second Matlab program that plots S and C for the case n = 3 and 
marks the intersection points. 

Are there any limitations on the input data for your codes to work? What 
happens if S and C do not intersect? What happens if S lies in C? 

Run your codes on at least the following examples. You should be able to 
solve the first five by hand to check your code. 


1. 


| 
|. 


g =t p =lH p=24 F= | TRT oe aa 
a a ia | a 3ar1 + ma 7x3 + 10 
EA E | aa E ETT Tag +10 
a =? m=140,m=242,F=| 4 ne eee | 

f -fgs gard ro | EOTS ee | 


+ £2 + T3 Tı 

= 72 — t2 =? t? F= a 

g t4, go 1+t*, 93 +t,” T3 3£1 + 5£2 — 7x3 + 10 
g=7—3t+t, pal PEE pair et, 


|. 
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1x2 + 23 3-23 5 + £1 + £2 + £3 + £1£2 + 2173+ FQX3 
F= | 22-723 1 — x? +2273 3 + £1 + 323 — 9T2T3 
2 321 + 5L2 — 7273 + 8 x? — rå + 423 


You should turn in 

e mathematical formulation of the solution in terms of an eigenproblem. 

e the algorithm in at most two pages, including a road map to your code 
(subroutine names for each high level operation). It should be easy to see how 
the mathematical formulation leads to the algorithm and how the algorithm 
matches the code. 

— At most how many discrete solutions can there be? 

— Do all compute eigenvalues represent actual intersections? Which ones 
do? 

— What limits does your code place on the input for it to work correctly? 

— What happens if S and C do not intersect? 

— What happens if S contains C? 

e mathematical formulation of the error bounds. 

e the algorithm for computing the error bounds in at most two pages, 
including a road map to your code (subroutine names for each high-level op- 
eration). It should be easy to see how the mathematical formulation leads to 
the algorithm and how the algorithm matches the code. 

e program listing. 

For each of the seven examples, you should turn in 

e the original statement of the problem. 

e the resulting eigenproblem. 

e the numerical solutions. 

e plots of S and C; do your numerical solutions match the plots? 

e the result of substituting the computed answers in the equations defining 
S and C: are they satisfied (to within roundoff)? 


3 


The Symmetric Eigenproblem and 
Singular Value Decomposition 


5.1. Introduction 


We discuss perturbation theory (in section 5.2), algorithms (in sections 5.3 
and 5.4), and applications (in section 5.5 and elsewhere) of the symmetric 
eigenvalue problem. We also discuss its close relative, the SVD. Since the 


T 
eigendecomposition of the symmetric matrix H = [ i r ] and the SVD of A 


are very simply related (see Theorem 3.3), most of the perturbation theorems 
and algorithms for the symmetric eigenproblem extend to the SVD. 

As discussed at the beginning of Chapter 4, one can roughly divide the 
algorithms for the symmetric eigenproblem (and SVD) into two groups: direct 
methods and iterative methods. This chapter considers only direct methods, 
which are intended to compute all (or a selected subset) of the eigenvalues 
and (optionally) eigenvectors, costing O(n?) operations for dense matrices. 
Iterative methods are discussed in Chapter 7. 

Since there has been a great deal of recent progress in algorithms and 
applications of symmetric eigenproblems, we will highlight three examples: 


e A high-speed algorithm for the symmetric eigenproblem based on divide- 
and-conquer is discussed in section 5.3.3. This is the fastest available 
algorithm for finding all eigenvalues and all eigenvectors of a large dense 
or banded symmetric matrix (or the SVD of a general matrix). It is sig- 


nificantly faster than the previous “workhorse” algorithm, QR iteration. 
16 


e High-accuracy algorithms based on the dqds and Jacobi algorithms are 
discussed in sections 5.2.1, 5.4.2, and 5.4.3. These algorithms can find 


‘There is yet more recent work [199, 201] on an algorithm based on inverse iteration 
(Algorithm 4.2), which may provide a still faster and more accurate algorithm. But as of 
September 1996 the theory and software were still under development. 
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tiny eigenvalues (or singular values) more accurately than alternative 
algorithms like divide-and-conquer, although sometimes more slowly. 


e Section 5.5 discusses a “nonlinear” vibrating system, described by a dif- 
ferential equation called the Toda flow. Its continuous solution is closely 
related to the intermediate steps of the QR algorithm for the symmetric 
eigenproblem. 


Following Chapter 4, we will continue to use a vibrating mass-spring system 
as a running example to illustrate features of the symmetric eigenproblem. 


EXAMPLE 5.1. Symmetric eigenvalue problems often arise in analyzing me- 
chanical vibrations. Example 4.1 presented one such example in detail; we will 
use notation from that example, so the reader is advised to review it now. To 
make the problem in Example 4.1 symmetric, we need to assume that there is 
no damping, so the differential equations of motion of the mass-spring system 
become M(t) = —Ka(t), where M = diag(m1,...,™m,) and 


kı +k2 —ke 
—kə  kə+k3 —k3 
K= ; es : 
—kn-1 kn—1 + kn —kn 
—kn ki 
Since M is nonsingular, we can rewrite this as (t) = -M~!Ka2(t). If we seek 


solutions of the form z(t) = e%%z(0), then we get e%y?x(0) = —M~!Ke™a(0), 
or M~!K2(0) = —7?2(0). In other words, —y? is an eigenvalue and x(0) is 
an eigenvector of M~'K. Now M~!K is not generally symmetric, but we 


can make it symmetric as follows. Define M"? = diag(m,! 2 ee mil oy and 
multiply M~!K (0) = —7?x(0) by M!/? on both sides to get 


M~'/?K2(0) = M-?K(M-1?M/?)2(0) = —7?M'/22(0) 


or Kĉ = —7?%, where # = M‘/22(0) and K = M7!/2kKM~!/2. It is easy to 
see that 


kit+k2 —ke 
mı mim 
—k2 k2+k3 —k3 
mima m2 J/m2ms3 
—kn-1 kn-itkn —kn 
Mn—2Mn—1 Mn—1 yet 
Z nn m 


is symmetric. Thus each eigenvalue —7? of K is real, and each eigenvector 
& = M'/?x(0) of K is orthogonal to the others. 
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In fact, Kisa tridiagonal matrix, a special form to which any symmetric 
matrix can be reduced, using Algorithm 4.6, specialized to symmetric matrices 
as described in section 4.4.7. Most of the algorithms in section 5.3 for finding 
the eigenvalues and eigenvectors of a symmetric matrix assume that the matrix 
has initially been reduced to tridiagonal form. 

There is another way to express the solution to this mechanical vibra- 
tion problem, using the SVD. Define Kp = diag(kı,...,kn) and K? = 


D 
diag(ky!”.. E ki/?), Then K can be factored as K = BKpB", where 


as can be confirmed by a small calculation. Thus 


ee MR 
= M/?BkKp)BTM~}? 
(M2 BK?) : (KPRM) 
= (M-'2 BKH?) : (M2 BK? Y 
= GGT. (5.1) 


Therefore the singular values of G = M7" 2BKY ? are the square roots of the 
eigenvalues of K, and the left singular vectors of G are the eigenvectors of K, 
as shown in Theorem 3.3. Note that G is nonzero only on the main diagonal 
and on the first superdiagonal. Such matrices are called bidiagonal, and most 
algorithms for the SVD begin by reducing the matrix to bidiagonal form, using 
the algorithm in section 4.4.7. 

Note that the factorization K = GGT implies that K is positive definite, 
since G is nonsingular. Therefore the eigenvalues —y? of K are all positive. 
Thus y is pure imaginary, and the solutions of the original differential equation 
a(t) = e!x(0) are oscillatory with frequency |y]. 

For a Matlab solution of a vibrating mass-spring system, see 
HOMEPAGE/Matlab/massspring.m. For a Matlab animation of the vibra- 
tions of a similar physical system, see demo/continue/fun-extras/miscellaneous/ 
bending. © 


5.2. Perturbation Theory 


Suppose that A is symmetric, with eigenvalues a; > --- > Qn and corre- 
sponding unit eigenvectors q1,...,qn. Suppose E is also symmetric, and let 
Â = A + E have perturbed eigenvalues 4; > --- > @ and corresponding per- 
turbed eigenvectors ĝ1,...,ĝn. The major goal of this section is to bound the 
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differences between the eigenvalues a; and @;, and between the eigenvectors qi 
and q; in terms of the “size” of E. Most of our bounds will use ||E||2 as the size 
of E, except for section 5.2.1, which discusses “relative” perturbation theory. 

We already derived our first perturbation bound for eigenvalues in Chap- 
ter 4, where we proved Corollary 4.1: Let A be symmetric with eigenvalues 
ai >- > Qan. Let A+ E be symmetric with eigenvalues G1 >--- > ân. If Qi 
is simple, then |a; — âil < ||Ell2 + O(||E||3). 

This result is weak because it assumes a; has multiplicity one, and it is 
useful only for sufficiently small ||E|/2. The next theorem eliminates both 
weaknesses. 


THEOREM 5.1. Weyl. Let A and E be n-by-n symmetric matrices. Let a, > 


oe > an be the eigenvalues of A and Q, > --- > Gy be the eigenvalues of 


COROLLARY 5.1. Let G and F be arbitrary matrices (of the same size) where 


01 È --- > On are the singular values of G and of >--- > ol, are the singular 
values of G+ F. Then |c; — o}| < ||Flo- 


We can use Weyl’s theorem to get error bounds for the eigenvalues com- 
puted by any backward stable algorithm, such as QR iteration: Such an algo- 
rithm computes eigenvalues â; that are the exact eigenvalues of A=A+E 
where || E||2 = O(e)||All2. Therefore, their errors can be bounded by |a;—a;| < 
|E\]2 = O(€)||A]l2 = O(e) max; |a;|. This is a very satisfactory error bound, es- 
pecially for large eigenvalues (those a; near ||A||2 in magnitude), since they will 
be computed with most of their digits correct. Small eigenvalues (|a;| < ||All2) 
may have fewer correct digits (but see section 5.2.1). 

We will prove Weyl’s theorem using another useful classical result: the 
Courant—Fischer minimax theorem. To state this theorem we need to intro- 
duce the Rayleigh quotient, which will also play an important role in several 
algorithms, such as Algorithm 5.1. 


DEFINITION 5.1. The Rayleigh quotient of a symmetric matrix A and nonzero 
vector u is p(u, A) = (u? Au) /(uf u). 


Here are some simple but important properties of p(u, A). First, p(yu, A) = 
plu, A) for any nonzero scalar y. Second, if Ag; = aiqi, then p(q@,A) = Qi. 
More generally, suppose QT AQ = A = diag(a;) is the eigendecomposition of 
A, with Q = [q@,..-,@n]. Expand u in the basis of eigenvectors q; as follows: 
u = Q(QTu) = QE = J`; Gi. Then we can write 


_ EQTAQE _ ETAE DV ai? 

ETQTOQE EEE DE 
In other words, p(u, A) is a weighted average of the eigenvalues of A. Its largest 
value, mMaxy,=o p(u, A), occurs for u = qı (E = e1) and equals p(q, A) = ay. 


p(u, A) 
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Its smallest value, min, plu, A), occurs for u = qn (E = en) and equals 
P(dn, A) = an. Together, these facts imply 


max |p(u, A)| = max(|ar|,|on|) = lAl- (5.2) 
THEOREM 5.2. Courant—Fischer minimax theorem. Let ay > -:: > Qn be 
eigenvalues of the symmetric matrix A and qi,...,Qn be the corresponding unit 
eigenvectors. 
i A)=a;= mi A). 
Be omen A) = OF iP oa esas POA) 


The maximum in the first expression for a; is over all j dimensional sub- 
spaces RJ of R”, and the subsequent minimum is over all nonzero vectors r in 
the subspace. The maximum is attained for RÍ = span(q1,qQ,--- qj), and a 
minimizing r is r = qj. 

The minimum in the second expression for œj is over all (n — j + 1)- 
dimensional subspaces S"-J+! of R”, and the subsequent marimum is over all 
nonzero vectors s in the subspace. The minimum is attained for STIHL = 
span (qj, qj+1,---; qn); and a maximizing s is s = qj. 


EXAMPLE 5.2. Let j = 1, so a; is the largest eigenvalue. Given Rt, p(r, A) 
is the same for all nonzero r € Rt, since all such r are scalar multiples of one 
another. Thus the first expression for a; simplifies to ay = max,;=o p(r, A). 
Similarly, since n — j + 1 = n, the only subspace S"-J+! is R”, the whole 
space. Then the second expression for a; also simplies to aj = maxs=o p(s, A). 

One can similarly show that the theorem simplifies to the following expres- 
sion for the smallest eigenvalue: a, = min;=9 p(r, A). © 


Proof of the Courant-Fischer minimax theorem. Choose any subspaces R’ 
and S"—J+! of the indicated dimensions. Since the sum of their dimensions 
jgt+(n-—j+1) = n +1 exceeds n, there must be a nonzero vector tag € 
RI S"-5+1, Thus 


min p(r,A) < p(ars,A)< max , A). 
pimin , er, A) S$ pons, 4) < max p(s, A) 


Now choose RÍ to maximize the expression on the left, and choose S”-J+1 to 
minimize the expression on the right. Then 


max min p(r, A) min p(r, A) (5.3) 
RI 0=reR/ 0=reRi 
plagg: A) 
max p(s, A) 
0=seS”—jt1 


IN IA 


min max p(s, A). 
gr-J1 Qasr] 
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To see that all these inequalities are actually equalities, we exhibit partic- 
ular RÍ and S"-J+! that make the lower bound equal the upper bound. First 
choose RJ = span(qi,...,qj), so that 


max min p(r,A) > min p(r, A) 
RI 0=reR/ 0=reR/ 


= min pír, A 
0=r= i<j ĉiqi ( ) 


=- min Pisz $1 o grai = Qj 
some €;=0 Le é? qe 
Next choose S”! = span(qj,-.-,@n) so that 


min max s,A) < max s,A 
SH a UA): Se gna Oley) 


max p(s, A 
0=s=)0 53; ĉiqi ( ) 


2 
= max Zij S10 ŝi K = Qj 
some €;=0 aD; ¿z a 


Thus, the lower and upper bounds are sandwiched between a; below and 
a; above, so they must all equal a; as desired. 


EXAMPLE 5.3. Figure 5.1 illustrates this theorem graphically for 3-by-3 ma- 
trices. Since p(u/||ull2, A) = plu, A), we can think of p(u, A) as a function on 
the unit sphere ||u||2 = 1. Figure 5.1 shows a contour plot of this function on 
the unit sphere for A = diag(1,.25,0). For this simple matrix q = e;, the ith 
column of the identity matrix. The figure is symmetric about the origin since 
plu, A) = p(—u, A). The small red circles near +q; surround the global maxi- 
mum p(+q1, A) = 1, and the small green circles near +q3 surround the global 
minimum p(+q3, A) = 0. The two great circles are contours for p(u, A) = .25, 
the second eigenvalue. Within the two narrow (green) “apple slices” defined 
by the great circles, p(u,A) < .25, and within the wide (red) apple slices, 
p(u, A) > .25. 

Let us interpret the minimax theorem in terms of this figure. Choosing 
a space R? is equivalent to choosing a great circle C; every point on C lies 
within RÊ, and R? consists of all scalar multiplicatons of the vectors in C. 
Thus ming_,cR2 P(r, A) = min;-ec p(r, A). There are four cases to consider to 
compute min,;ec p(T, A): 


1. C does not go through the intersection points q2 of the two great circles 
in Figure 5.1. Then C clearly must intersect both a narrow green apple 
slice (as well as a wide red apple slice), so min,-ec p(r, A) < .25. 


2. C does go through the two intersection points +q2 and otherwise lies in 
the narrow green apple slices. Then min,¢c p(r, A) < .25. 
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Fig. 5.1. Contour plot of the Rayleigh quotient on the unit sphere. 


3. C does go through the two intersection points +q2 and otherwise lies 
in the wide red apple slices. Then minec p(r, A) = .25, attained for 


r = q2. 


4. C coincides with one of the two great circles. Then p(r, A) = .25 for all 
rec, 


The minimax theorem says that ag = .25 is the maximum of min,¢c p(r, A) 
over all choices of great circle C. This maximum is attained in cases 3 and 
4 above. In particular, for C bisecting the wide red apple slices (case 3), 
R? = span(q1, q2). 

Software to draw contour plots like those in Figure 5.1 for an 
arbitrary 3-by-3 symmetric matrix may be found at 
HOMEPAGE/Matlab/RayleighContour.m. © 


Finally, we can present the proof of Weyl’s theorem. 


T 
‘ u (A+ E)u - 
a = min max Aa by the minimax theorem 
Sr—-i+1 0=uEeSr-sjt+1 u+ u 
: uT Au u? Eu 
= min max _ T T 
S”-3+1 Q=uEeSnr—-jtl UU UU 
sä 
: ut Au ; 
< min max r + |El by equation (5.2) 
S”-3+1 Q=uEeSr—-jtl Uy u 
= a+ ||Ell2 by the minimax theorem again. 


Reversing the roles of A and A+ E, we also get a; < âi + ||E|l2. Together, 
these two inequalities complete the proof of Weyl’s theorem. 

A theorem closely related to the Courant—Fischer minimax theorem, one 
that we will need later to justify the Bisection algorithm in section 5.3.4, is 
Sylvester’s theorem of inertia. 
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DEFINITION 5.2. The inertia of a symmetric matrix A is the triple of integers 
Inertia(A) = (v,¢,7), where v is the number of negative eigenvalues of A, ¢ is 
the number of zero eigenvalues of A, and x is the number of positive eigenvalues 
of A. 


If X is orthogonal, then XTAX and A are similar and so have the same 
eigenvalues. When X is only nonsingular, we say XT AX and A are congruent. 
In this case XT AX will generally not have the same eigenvalues as A, but the 
next theorem tells us that the two sets of eigenvalues will at least have the 
same signs. 


THEOREM 5.3. (Sylvester’s Inertia Theorem.) Let A be symmetric and 
X be nonsingular. Then A and X’ AX have the same inertia. 


Proof. Let n be the dimension of A. Now suppose that A has v negative 
eigenvalues but that XTAX has v’ < v negative eigenvalues; we will find a 
contradiction to prove that this cannot happen. Let N be the corresponding 
v dimensional negative eigenspace of A; i.e., N is spanned by the eigenvectors 
of the v negative eigenvalues of A. This means that for any nonzero x € N, 
x? Ax < 0. Let P be the (n — v’)-dimensional nonnegative eigenspace of 
X? AX; this means that for any nonzero x € P, rt XTAXz« > 0. Since X 
is nonsingular, the space XP is also n — v’ dimensional. Since dim(N) + 
dim( XP) = v +n -v > n, the spaces N and XP must contain a nonzero 
vector x in their intersection. But then 0 > «7 Az since x € N and 0 < zT Ar 
since z € XP, which is a contradiction. Therefore, v = v’; i.e., A and X? AX 
have the same number of negative eigenvalues. An analogous argument shows 
they have the same number of positive eigenvalues. Thus, they must also have 
the same number of zero eigenvalues. 

Now we consider how eigenvectors can change under perturbations of A+ E 
of A. To state our bound we need to define the gap in the spectrum. 


DEFINITION 5.3. Let A have eigenvalues a, > --- > an. Then the gap between 
an eigenvalue a; and the rest of the spectrum is defined to be gap(i, A) = 
min;—;|a; —a;|. We will also write gap(i) if A is understood from the contest. 


The basic result is that the sensitivity of an eigenvector depends on the gap 
of its corresponding eigenvalue: a small gap implies a sensitive eigenvector. 


EXAMPLE 5.4. Let A=[179 Jand AFB =J t9 A 


Thus gap(i, A) = g ~ gap(i, A + E) for i = 1,2. The eigenvectors of A are 
just qı = e1 and q2 = eg. A small computation reveals that the eigenvectors 
of A + E are 


,where0<€<g. 


Q 


2e 2 
AnA 
2e 


g 


å =p. 


Q 


Qiao = 
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2e 


a a 
n= 0. alae), 
1+ 1+ (2) 1 


where 6 ~ 1/2 is a normalization factor. We see that the angle between the 
perturbed vectors ĝ; and unperturbed vectors q; equals €/g to first order in e. 
So the angle is proportional to the reciprocal of the gap g. © 


The general case is essentially the same as the 2-by-2 case just analyzed. 


THEOREM 5.4. Let A = QAQ? = Qdiag(a;)Q? be an eigendecomposition of 
A. Let A+ E = A = QAQ™ be the perturbed eigendecomposition. Write 
Q = [q1,.--, qn] and Ô = lâi,- --, n], where qi and ĝi are the unperturbed and 
perturbed unit eigenvectors, respectively. Let 0 denote the acute angle between 
qi and ĝi. Then 


1 E 
5 sin 20 < ey provided that gap(i, A) > 0. 
Similarly 
1 E 
5 sin 20 < neat provided that gap(i, A+ E) > 0. 


Note that when 0 <1, then 1/2sin20 = sinð & 8. 


The attraction of stating the bound in terms of gap(i, A+ E), as well as 
gap(i, A), is that frequently we know only the eigenvalues of A+ E, since they 
are typically the output of the eigenvalue algorithm that we have used. In 
this case it is straightforward to evaluate gap(i, A+ E), whereas we can only 
estimate gap(i, A). 

When the first upper bound exceeds 1/2, i.e., ||E|]2 > gap(i, A)/2, the 
bound reduces to sin20 < 1, which provides no information about 6. Here 
is why we cannot bound @ in this situation: If E is this large, then A + E’s 
eigenvalue â; could be sufficiently far from a; for A + E to have a multiple 
eigenvalue at a;. For example, consider A = diag(2,0) and A+ E = I. But 
such an A+ E does not have a unique eigenvector qi; indeed, A+ E = I has 
any vector as an eigenvector. Thus, it makes no sense to try to bound 6. The 
same considerations apply when the second upper bound exceeds 1/2. 

Proof. It suffices to prove the first upper bound, because the second one 
follows by considering A+ E as the unperturbed matrix and A = (A+ E) —E 
as the perturbed matrix. 

Let qi + d be an eigenvector of A + E. To make d unique, we impose the 
restriction that it be orthogonal to q; (written d L qi) as shown below. Note 


204 


that this means that q; +d is not a unit vector, so ĝi 


tan 6 = ||d||2 and sec 8 = ||q; + dll. 
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(qit+d)/|lqa:+dll2. Then 


qj +d 


qi 
Now write the ith column of (A+ E)Q=QA 


(A+ E)(qi + d) = du(gi + 


where we have also multiplied each side by ||q; 4 


d), (5.4) 


- dio. Define 7 = âi — ai. 


Subtract Aq; = a;q from both sides of (5.4) and rearrange to get 


(A — ail )d = (nI — E) (q; - 


- d). (5.5) 


Since q? (A — a;l) = 0, both sides of (5.5) are orthogonal to q. This lets us 
write z = (I — E) (qi +d) = >) j_; Gj and d = j; igj. Since (A—ail)qj = 


a; — a;)q;, we can write 
j J 


(A—ajl)d = J osa G 
j=i j=i 


= (nI — E) (qi + d) 


or 
Gi 
d=) 5 = g 
j=i a 
Thus 
tan? = |ldllo 
a Gi 
= Walk 
j=i 
»\ 2 
= `. ( Sj ) since the q; are orthonormal 
— \ aj- ai 
Ji 
1/2 
is SS e since gap(i, A) is the 
T gapli, A) | —+ ¥ smallest denominator 
j=i 
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If we were to use Weyl’s theorem and the triangle inequality to bound ||z|l2 
(Ello + In|) - Ila; + dll2 < 2||Ell2sec @, then we could conclude that sin 0 
2||E|l2/gap(é, A). 

But we can do a little better than this by bounding ||z||2 = ||(nJ — E) (qi + 
d)||2 more carefully: Multiply (5.4) by qf on both sides, cancel terms, and 
rearrange to get n = q7 E(q; + d). Thus 


< 
< 


z = (qi +d)n-— E(qi +d) =(a%+4q Elqi +d) — E(qi + d) 
= ((qi+d)q} — T)E(qi + d), 


and so ||zll2 < ||(@+d)q/ -Ill Ele lg+dl|. We claim that ||(q+d)q? —I|l2 = 
liq; + dllz (see Question 5.7). Thus ||z|l2 < ||q; + d]l3 - || Ell2, so 


lzl2 < la +alġllEll2 _ sec? - Ela 


tanl < 


gap(i, A) ~  gap(i, A) gap(i, A) 
= || | 0 1 
2 tan ; ; 
> = 0 = -sin 20 
aCA 0 sin 6 cos 5 sin 
as desired. 


An analogous theorem can be proven for singular vectors (see Question 5.8). 

The Rayleigh quotient has other nice properties. The next theorem tells us 
that the Rayleigh quotient is a “best approximation” to an eigenvalue in a nat- 
ural sense. This is the basis of the Rayleigh quotient iteration in section 5.3.2 
and the iterative algorithms in Chapter 7. It may also be used to evaluate the 
accuracy of an approximate eigenpair obtained in any way at all, not just by 
the algorithms discussed here. 


THEOREM 5.5. Let A be symmetric, x be a unit vector, and B be a scalar. 
Then A has an eigenpair Aq; = aiqi satisfying |a; — B| < || Ax — Ballo. Given 
x, the choice B = p(x, A) minimizes ||Ax — Blo. 

With a little more information about the spectrum of A, we can get tighter 
bounds. Let r = Ax— p(x, A)x. Let a; be the eigenvalue of A closest to p(x, A). 
Let gap’ = min,—; |a; — p(x, A)|; this is a variation on the gap defined earlier. 
Let 0 be the acute angle between x and qi. Then 


sin < re (5.6) 
ae Irig 
r 
lai — p(x, A)| < Zea (5.7) 


See Theorem 7.1 for a generalization of this result to a set of eigenvalues. 
Notice that in equation (5.7) the difference between the Rayleigh quotient 
p(x, A) and an eigenvalue a; is proportional to the square of the residual norm 
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|| ||2. This high accuracy is the basis of the cubic convergence of the Rayleigh 
quotient iteration algorithm of section 5.3.2. 

Proof. We prove only the first result and leave the others for questions 5.9 
and 5.10 at the end of the chapter. 


If 3 is an eigenvalue of A, the result is immediate. So assume instead that 
A— GI is nonsingular. Then 2 = (A — BITH (A — BI)x and 


1 = |la\l2 < (A 821)" 2 (A — bD)zll2. 


Writing A’s eigendecomposition as A = QAQ? = Qdiag(az,...,an)Q?, we 
get 


(A = 8D] = IRA = 81)" Q" = IA — BD)" lo = 1/ min Jag — £, 


so min; |a; — B| < ||(A — BI)z|lo as desired. 

To show that 8 = p(x, A) minimizes ||Ax — Gx\|l2 we will show that x is 
orthogonal to Ax — p(x, A)x so that applying the Pythagorean theorem to the 
sum of orthogonal vectors 


Ax — Bx = [Ax — p(x, A)x] + [(e(z, A) — 8)a] 
yields 


|| Ax — zl 


|| Ax — p(x, A)z||3 + [I(o(a, A) — 8)2\l3 
> ||Ax — p(z, A)a|l3 


with equality only when 8 = p(x, A). 
To confirm orthogonality of z and Ax — p(x, A)x we need to verify that 


TA T 
fa T am = zT Ar — rT Ar =0 


ee A)x) = xT (Ax — = 
x (Ax — p(x, A)z) = x” (Ax aa aa 


as desired. 


EXAMPLE 5.5. We illustrate Theorem 5.5 using a matrix from Example 5.4. 


Let A= [ wie i ], where 0 < e < g. Let x = [1,0] and 8 = p(z, A) = 1 +g. 


Then r = Ag — Bx = [0,e]f and ||r||z2 = €. The eigenvalues of A are ay = 
1+4(1+ (2£)?)¥ 2 and the eigenvectors are given in Example 5.4 (where 
the matrix is called A+ E instead of A). 

Theorem 5.5 predicts that || Ax—G2||2 = ||r||2 = € is a bound on the distance 
from @ = 1 + to the nearest eigenvalue a, of A; this is also predicted by 
Weyl’s theorem (Theorem ‘5.1). We will see below that this bound is much 
looser than bound (5.7). 

When e is much smaller than g, there will be one eigenvalue near 1 + g 
with its eigenvector near x and another eigenvalue near 1 with its eigenvector 
near [0,1]. This means gap’ = |a- — p(x, A)| = £11 + (1+ Gry): and 
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so bound (5.6) implies that the angle 6 between x and the true eigenvector is 
bounded by 

Irl _ 2e/g 

pap 1+ (1+ (2)?)¥?" 


sind < 


Comparing with the explicit eigenvectors in Example 5.4, we see that the upper 
bound is actually equal to tan @, which is nearly the same as sin@ for tiny 0. 
So bound (5.6) is quite accurate. 

Now consider bound (5.7) on the difference |8 — a |. It turns out that for 
this 2-by-2 example both |8 — a,| and its bound are exactly equal to 


irl e/g 


€ i 
gap’ Ce ae 


Let us evaluate these bounds in the special case where g = 107? and 
e = 1075. Then the eigenvalues of A are approximately a; = 1.01000001 = 
1.01 + 1078 and a_ = .99999999 = 1 — 1078. The first bound is |G — a,| < 
\|7||2 = 1075, which is 10° times larger than the actual error 1078. In contrast, 
bound (5.7) is |8 — a4| < ||r||2/gap’ = (10->)2/(1.01 — a_) = 1078, which is 
tight. The actual angle 0 between x and the true eigenvector for a+ is about 
1078, as is the bound ||r|l2/gap’ = 107°/(1.01 -a_) ~ 1078. o 


Finally, we discuss what happens when one has a group of & tightly clus- 
tered eigenvalues, and wants to compute their eigenvectors. By “tightly clus- 
tered” we mean that the gap between any eigenvalue in the cluster and some 
other eigenvalue in the cluster is small but that eigenvalues not in the clus- 
ter are well separated. For example, one could have k = 20 eigenvalues in 
the interval [.9999,1.0001], but all other eigenvalues might be greater than 2. 
Then Theorems 5.4 and 5.5 indicate that we cannot hope to get the individual 
eigenvectors accurately. However, it is possible to compute the k-dimensional 
invariant subspace spanned by these vectors quite accurately. See [195] for 
details. 


5.2.1. Relative Perturbation Theory 


This section describes tighter bounds on eigenvalues and eigenvectors than in 
the last section. These bounds are needed to justify the high-accuracy algo- 
rithms for computing singular values and eigenvalues described in sections 5.4.2 
and 5.4.3. 

To contrast the bounds that we will present here to those in the previous 
section, let us consider the 1-by-1 case. Given a scalar a, a perturbed scalar 
â = a + e and a bound |e| < €, we can obviously bound the absolute error in 
â by |â&— a| < e. This was the approach taken in the last section. Consider 
instead the perturbed scalar @ = x?a and a bound |z? — 1| < e. This lets us 
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bound the relative error in â by 


[â — al 

lal 
We generalize this simple idea to matrices as follows. In the last section we 
bounded the absolute difference in the eigenvalues a; of A and â; of A= A+E 


by |â: — ail < ||El]2. Here we will bound the relative difference between the 
eigenvalues a; of A and â; of A= X7 AX in terms of e = || XTX — Ilo. 


= |z? — 1| < e. 


THEOREM 5.6. “Relative” Weyl. Let A have eigenvalues a; and Â= XTAX 
have eigenvalues â;. Let e = ||XTX — Il||2. Then |â; — a;| < |a;ile. If a; = 0, 
then we can also write . 

|â; — a 


<e. (5.8) 


lail 
Proof. Since the ith eigenvalue of A — a;l is zero, Sylvester’s theorem of 
inertia tells us that the same is true of 


XT(A — o;I)X = (XTAX - a;l) +a; - XTX) =H +F. 


Weyl’s theorem says that |\;(H)—0| < ||Fll2, or |â&i— ail < lail I| XTX —T|l2 = 
Ja;lļe. 

Note that when X is orthogonal, € = || XTX — I||2 = 0, so the theorem 
confirms that XTAX and A have the same eigenvalues. If X is “nearly” 
orthogonal, i.e., € is small, the theorem says the eigenvalue are nearly the 
same, in the sense of relative error. 


COROLLARY 5.2. Let G be an arbitrary matrix with singular values ci, and let 

G&=YTGX have singular values 6;. Let €e = max(||XTX — Ill, |Y TY — I|l2). 
Then |G; — c| < eci. If oi = 0, then we can write 

|ô; — a 

—— <e. 5.9 

< (5.9) 

We can similarly extend Theorem 5.4 to bound the difference between 

eigenvectors q; of A and eigenvectors ĝ; of A = XTAX. To do so, we need to 

define the relative gap in the spectrum. 


DEFINITION 5.4. The relative gap between an eigenvalue a; of A and the rest 
joj =a 
Jail 


of the spectrum is defined to be rel_gap(i, A) = minj=i 


THEOREM 5.7. Suppose that A has eigenvalues a; and corresponding unit eigen- 
vectors qi. Suppose A = XTAX has eigenvalues Q; and corresponding unit 
eigenvectors ĝi. Let 0 be the acute angle between qi and ĝi. Let «, = ||I — 
XTX! lo and eg = ||X—I||2. Then provided that €, < 1 andrel_gap(i, X' AX) > 


0, 
1 El 1 
=~ sin 20 < . + €. 
E €, rel gapli, XTAX) 5 
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Proof. Let n= 6; — ai, H = A — âil, and F = 4;(I — X-? X—!). Note that 
H + F = A — êX TX! = XT(XTAX — âil) xk 


Thus Hq; = —nq; and (H + F)(X4q;) = 0 so that Xĝ; is an eigenvector of 
H + F with eigenvalue 0. Let 6; be the acute angle between q; and X ĝi. By 
Theorem 5.4, we can bound 


ILF Ile 


GB) H+) (5.10) 


1 

— sin 26, < 

2 
We have ||F]|2 = |G;/e1. Now gap(i, H + F) is the magnitude of the small- 
est nonzero eigenvalue of H + F. Since XT(H + F)X = XTAX — A;I has 
eigenvalues âj — âi, Theorem 5.6 tells us that the eigenvalues of H + F lie in 
intervals from (1 — €1)(@; — âi) to (1 + &)(a; — âi). Thus gap(i, H + F) > 
(1 — e )gap(i, X? AX), and so substituting into (5.10) yields 

€1|Qi| € 


1 
TM = . (11 
2° "1 = Te )gap(i, XTAX) (1 —e)rel gap(i, XTAX) me 


Now let 02 be the acute angle between Xĝ; and ĝi so that 0 < 01 + 02. 
Using trigonometry we can bound sin 42 < ||(X — I)gi|l2 < |X — I ||2 = €2, and 
so by the triangle inequality (see Question 5.11) 


1 1 1 
—sin29 < —=sin26, + -sin 265 
2 2 2 
1 
< 5 sin 20; + sin bə 
< = + €9 


(1 — €1)rel_gap(i, X7 AX) 


as desired. 
An analogous theorem can be proven for singular vectors [99]. 


EXAMPLE 5.6. We again consider the mass-spring system of Example 5.1 and 
use it to show that bounds on eigenvalues provided by Weyl’s theorem (The- 
orem 5.1) can be much worse (looser) than the “relative” version of Weyl’s 
theorem (Theorem 5.6). We will also see that the eigenvector bound of Theo- 
rem 5.7 can be much better (tighter) than the bound of Theorem 5.4. 
Suppose that M = diag(1, 100, 10000) and Kp = diag(10000, 100, 1). Fol- 
lowing Example 5.1, we define K = BKpB* and K= M-12 KM", where 
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and so 
10100 —10 
K = MKM"? = | -10 1.01 —.001 
—.001 .0001 


To five decimal places, the eigenvalues of K are 10100, 1.0001 and .00099. 
Suppose we now perturb the masses (m,;) and spring constants (kpi) by at 
most 1% each. How much can the eigenvalues change? The largest matrix 
entry is Kı, and changing mıı to .99 and kp to 10100 will change Kı 
to about 10305, a change of 205 in norm. Thus, Weyl’s theorem tells us 
each eigenvalue could change by as much as +205, which would change the 
smaller two eigenvalues utterly. The eigenvector bound from Theorem 5.4 also 
indicates that the corresponding eigenvectors could change completely. 

Now let us apply Theorem 5.6 to K, or actually Corollary 5.2 to G = 
MRK, where K = GG? as defined in Example 5.1. Changing each 
mass by at most 1% is equivalent to perturbing G to XG, where X is diagonal 
with diagonal entries between 1/V/.99 © 1.005 and 1/V1.01 = .995. Then 
Corollary 5.2 tells us that the singular values of G can change only by factors 
within the interval [.995, 1.005], so the eigenvalues of M can change only by 
1% too. In other words, the smallest eigenvalue can change only in its second 
decimal place, just like the largest eigenvalue. Similarly, changing the spring 
constants by at most 1% is equivalent to changing G to GX, and again the 
eigenvalues cannot change by more than 1%. If we perturb both M and Kp at 
the same time, the eigenvalues will move by about 2%. Since the eigenvalues 
differ so much in magnitude, their relative gaps are all quite large, and so their 
eigenvectors can rotate only by about 3% in angle too. 


For a different approach to relative error analysis, more suitable for matrices 
arising from differential (“unbounded”) operators, see [159]. 


5.3. Algorithms for the Symmetric Eigenproblem 


We discuss a variety of algorithms for the symmetric eigenproblem. As men- 
tioned in the introduction, we will discuss only direct methods, leaving iterative 
methods for Chapter 7. 

In Chapter 4 on the nonsymmetric eigenproblem, the only algorithm that 
we discussed was QR iteration, which could find all the eigenvalues and op- 
tionally all the eigenvectors. We have many more algorithms available for the 
symmetric eigenproblem, which offer us more flexibility and efficiency. For 
example, the Bisection algorithm described below can be used to find only the 
eigenvalues in a user-specified interval [a,b] and can do so much faster than it 
could find all the eigenvalues. 

All the algorithms below, except Rayleigh quotient iteration and Jacobi’s 
method, assume that the matrix has first been reduced to tridiagonal form, 
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using the variation of Algorithm 4.6 in section 4.4.7. This is an initial cost of 
8 


$n? flops, or sn3 flops if eigenvectors are also desired. 
1. Tridiagonal QR iteration. This algorithm finds all the eigenvalues, 
and optionally all the eigenvectors, of a symmetric tridiagonal matrix. 
Implemented efficiently, it is currently the fastest practical method to 
find all the eigenvalues of a symmetric tridiagonal matrix, taking O(n?) 
flops. Since reducing a dense matrix to tridiagonal form costs fn? flops, 
O(n?) is negligible for large enough n. But for finding all the eigenvectors 

as well, QR iteration takes a little over 6n? flops on average and is only 
the fastest algorithm for small matrices, up to about n = 25. This is 
the algorithm underlying the Matlab command eig!” and the LAPACK 
routines ssyev (for dense matrices) and sstev (for tridiagonal matrices). 


2. Rayleigh quotient iteration. This algorithm underlies QR iteration, but 
we present it separately in order to more easily analyze its extremely 
rapid convergence and because it may be used as an algorithm by itself. 
In fact, it generally converges cubically (as does QR iteration), which 
means that the number of correct digits asymptotically triples at each 
step. 


3. Divide-and-conquer. This is currently the fastest method to find all the 
eigenvalues and eigenvectors of symmetric tridiagonal matrices larger 
than n = 25. (The implementation in LAPACK, sstevd, defaults to 
QR iteration for smaller matrices.) 


In the worst case, divide-and-conquer requires O(n?) flops, but in practice 
the constant is quite small. Over a large set of random test cases, it 
appears to take only O(n?) flops on average, and as low as O(n?) for 
some eigenvalue distributions. 


In theory, divide-and-conquer could be implemented to run in O(n-log? n) 
flops, where p is a small integer [129]. This super-fast implementation 
uses the fast multipole method (FMM) [122], originally invented for the 
completely different problem of computing the mutual forces on n elec- 
trically charged particles. But the complexity of this super-fast imple- 
mentation means that QR iteration is currently the algorithm of choice 
for finding all eigenvalues, and divide-and-conquer without the FMM is 
the method of choice for finding all eigenvalues and all eigenvectors. 


4. Bisection and inverse iteration. Bisection may be used to find just a 
subset of the eigenvalues of a symmetric tridiagonal matrix, say, those in 
an interval [a,b] or [aj, aj—;]. It needs only O(nk) flops, where k is the 


'TMatlab checks to see whether the argument of eig is symmetric or not and uses the 
symmetric algorithm when appropriate. 
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number of eigenvalues desired. Thus Bisection can be much faster than 
QR iteration when k <n, since QR iteration requires O(n”) flops. In- 
verse iteration (Algorithm 4.2) can then be used to find the corresponding 
eigenvectors. In the best case, when the eigenvalues are “well separated” 
(we explain this more fully later), inverse iteration also costs only O(nk) 
flops. This is much less than either QR or divide-and-conquer (with- 
out the FMM), even when all eigenvalues and eigenvectors are desired 
(k = n). But in the worst case, when many eigenvalues are clustered 
close together, inverse iteration takes O(nk?) flops and does not even 
guarantee the accuracy of the computed eigenvectors (although in prac- 
tice it is almost always accurate). So divide-and-conquer and QR are 
currently the algorithms of choice for finding all (or most) eigenvalues 
and eigenvectors, especially when eigenvalues may be clustered. Bisec- 
tion and inverse iteration are available as options in the LAPACK routine 
ssyevx. 


There is current research on inverse iteration addressing the problem of 
close eigenvalues, which may make it the fastest method to find all the 
eigenvectors eigenvectors (besides, theoretically, divide-and-conquer with 
the FMM) [103, 201, 199, 174, 171, 173, 267]. However, software imple- 
menting this improved version of inverse iteration is not yet available. 


. Jacobi’s method. This method is historically the oldest method for the 


eigenproblem, dating to 1846. It is usually much slower than any of 
the above methods, taking O(n?) flops with a large constant. But the 
method remains interesting, because it is sometimes much more accurate 
than the above methods. This is because Jacobi’s method is sometimes 
capable of attaining the relative accuracy described in section 5.2.1 and 
so can sometimes compute tiny eigenvalues much more accurately than 
the previous methods [81]. We discuss the high-accuracy property of 
Jacobi’s method in section 5.4.3, where we show how to compute the 
SVD. 


Subsequent sections describe these algorithms in more detail. Section 5.3.6 


presents comparative performance results. 


5.3.1. Tridiagonal QR Iteration 


Recall that the QR algorithm for the nonsymmetric eigenproblem had two 
phases: 


1. Given A, use Algorithm 4.6 to find an orthogonal Q so that QAQT = H 


is upper Hessenberg. 
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2. Apply QR iteration to H (as described in section 4.4.8) to get a sequence 
H = Ho, Hı, H2,... of upper Hessenberg matrices converging to real 
Schur form. 


Our first algorithm for the symmetric eigenproblem is completely analogous 
to this: 


1. Given A = AT, use the variation of Algorithm 4.6 in section 4.4.7 to find 
an orthogonal Q so that QAQ? =T is tridiagonal. 


2. Apply QR iteration to T to get a sequence T = To, Ti, 7b,... of tridiag- 
onal matrices converging to diagonal form. 


We can see that QR iteration keeps all the T; tridiagonal by noting that 
since QAQT is symmetric and upper Hessenberg, it must also be lower Hes- 
senberg, i.e., tridiagonal. This keeps each QR iteration very inexpensive. An 
operation count reveals the following: 


1. Reducing A to symmetric tridiagonal form T costs $n? + O(n?) flops, or 
$n’ + O(n?) flops if eigenvectors are also desired. 


2. One tridiagonal QR iteration with a single shift (“bulge chasing”) costs 
6n flops. 


3. Finding all eigenvalues of T takes only 2 QR steps per eigenvalue on 
average, for a total of 6n? flops. 


4. Finding all eigenvalues and eigenvectors of T costs 6n? + O(n?) flops. 
5. The total cost to find just the eigenvalues of A is 3n? + O(n?) flops. 


6. The total cost to find all the eigenvalues and eigenvectors of A is 82n? -+ 
O(n?) flops. 


We must still describe how the shifts are chosen to implement each QR 
iteration. Denote the ith iterate by 


ay by 


a T 
| bn—1 an | 
The simplest choice of shift would be o; = ay; this is the single shift QR 
iteration discussed in section 4.4.8. It turns out to be cubically convergent for 
almost all matrices, as shown in the next section. Unfortunately, examples 


exist where it does not converge [195, p. 76], so to get global convergence a 
slightly more complicated shift strategy is needed: We let the shift o; be the 


QAn-1 bn-1 ] 


Sa that is closest to an. This is called Wilkinson’s 


eigenvalue of [ 
shift. 
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THEOREM 5.8. Wilkinson. QR iteration with Wilkinson’s shift is globally, and 
at least linearly, convergent. It is asymptotically cubically convergent for almost 
all matrices. 


A proof of this theorem can be found in [195]. In LAPACK this routine 
is available as ssyev. The inner loop of the algorithm can be organized more 
efficiently when eigenvalues only are desired (ssterf; see also [102, 198]) than 
when eigenvectors are also computed (ssteqr). 


EXAMPLE 5.7. Here is an illustration of the convergence of tridiagonal QR 
iteration, starting with the following tridiagonal matrix (diagonals only are 
shown, in columns): 


24929 
1.263 1.263 
.96880 
To = tridiag | —.82812 — 82812 
48539 
payee —3.1883 | 
—.91563 


The following table shows the last offdiagonal entry of each T;, the last diagonal 
entry of each T;, and the difference between the last diagonal entry and its 
ultimate value (the eigenvalue a ~ —3.54627). The cubic convergence of the 
error to zero in the last column is evident. 


0 —3.1883 —.91563 2.6306 

1 -—5.7-10-2 -—3.5457 5.4-10-4 

2 295010 —3.5463 1.2.1071 

3 —6.1-107%3 —3.5463 0 

At this point 
1.9871 
| .77513 .77513 | 
1.7049 
T; = tridiag | —1.7207 —1.7207 f 
64214 
—6.1 - 10723 —6.1 - 10723 

—3.5463 


and we set the very tiny (4,3) and (3,4) entries to 0. This is called deflation 
and is stable, perturbing Ts by only 6.1 - 10723 in norm. We now apply QR 
iteration again to the leading 3-by-3 submatrix of T3, repeating the process to 
get the other eigenvalues. 
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5.3.2. Rayleigh Quotient Iteration 


Recall from our analysis of QR iteration in section 4.4 that we are implicitly 
doing inverse iteration at every step. We explore this more carefully when the 
shift we choose to use in the inverse iteration is the Rayleigh quotient. 


ALGORITHM 5.1. Rayleigh quotient iteration: Given xo with ||xo|l2 = 1, and 
a user-supplied stopping tolerance tol, we iterate 


al Ax 

Po = p(x, A) = Tro 
1=0 
repeat 

yi = (A — pi-1 1) tzxi—1 

zi = yi/ |lyill2 

pi = p(%i, A) 

t=1i+1 


until convergence (|| Axi — pixill2 < tol) 


When the stopping criterion is satisfied, Theorem 5.5 tells us that p; is 
within tol of an eigenvalue of A. 

If one uses the shift ci = ann in QR iteration and starts Rayleigh quotient 
iteration with xo = [0, ...,0, 1], then the connection between QR and inverse 
iteration discussed in section 4.4 can be used to show that the sequence of c; 
and p; from the two algorithms are identical (see Question 5.13). In this case 
we will prove that convergence is almost always cubic. 


THEOREM 5.9. Rayleigh quotient iteration is locally cubically convergent; i.e., 
the number of correct digits triples at each step once the error is small enough 
and the eigenvalue is simple. 


Proof. We claim that it is enough to analyze the case when A is diagonal. To 
see why, write QT AQ = A, where Q is the orthogonal matrix whose columns are 
eigenvectors, and A = diag(a1,...,@n) is the diagonal matrix of eigenvalues. 
Now change variables in Rayleigh quotient iteration to ĉ; = Q’ a; and ĝ; = 
Q?y;. Then 


ti Ag;  #)Q?AQS; _ £1 Ag; 7 
= T =T TNA ERTO = p(ĉi, A) 
T; Ti î Qt Qi ĉ; ĉi 
and Qj; = (A — pil) 1 Qĉ;, so 
ĝi = Q7 (A — pil) Qê; = (Q? AQ — pil) 2i = (A — pi). 


Therefore, running Rayleigh quotient iteration with A and gzọ is equivalent 
to running Rayleigh quotient iteration with A and ĉọ. Thus we will assume 
without loss of generality that A = A is already diagonal, so the eigenvectors 
of A are e;, the columns of the identity matrix. 


pi = p(x, A) 
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Suppose without loss of generality that x; is converging to e1, so we can 
write z; = e; + di, where ||d;||2 =€ < 1. To prove cubic convergence, we need 
to show that xj41 = e1 + di41 with ||dj+1\|2 = O(c). 

We first note that 


1 =a) zr; = (e1 + di)! (e1 + di) = ef e1 + 2e2 dj + dod; =14+2d, +e 
so that dj, = —e?/2. Therefore 
pi = a? Ax; =(e, + di)! A(ey +d;)= ef Ney + 2e? Ad; + di Ad; =a,—, 
where 7 = —2e? Ad; — df Ad; = œe — d? Adi. We see that 
In| < Jarle? + |lAlllldill} < 2All2€, (5.12) 


so pi = a, — N = a, + O(e?) is a very good approximation to the eigenvalue 
Q1. 
Now we can write 


yi = (A- piltti 
T 
z | Til Tiz Lin | 
ai — Ppi asp, Oy = Pi 


1 
because (A — p;I)~' = diag ( ) 
&j— Pi 


— [ltd di die | 
° Pena a 
because x; = e1 + di 
_ [e do ie e 
7 | n e e 
because pj = &ı — N) and diy = —e/2 
_ 1-e/2 di2n 
7 n e aa 
dinn r 
e arm | 
2 
= PEE . (e1 + di41). 


To bound |Iĝ;}ıll2, we note that we can bound each denominator using 
laj — a, +n| > gap(1, A) — |n|, so using (5.12) as well we get 

lldill2lnl 2llAlle? 
T= 2/2)(gap(1,A) =|) 0- eel, A) NAA 


or I\dis-tlle = O(e?). Finally, since %j41 = €1 + dis = (e1 + di41)/|le1 + disillo, 
we see ||d;+1||2 = O(e?) as well. 


IA 


IIdisille < ( 
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5.3.3. Divide-and-Conquer 


This method is the fastest now available if you want all eigenvalues and eigen- 
vectors of a tridiagonal matrix whose dimension is larger than about 25. (The 
exact threshold depends on the computer.) It is quite subtle to implement in a 
numerically stable way. Indeed, although this method was first introduced in 
1981 [58], the “right” implementation was not discovered until 1992 [125, 129]). 
This routine is available as LAPACK routines ssyevd for dense matrices and 
sstevd for tridiagonal matrices. This routine uses divide-and-conquer for ma- 
trices of dimension larger than 25 and automatically switches to QR iteration 
for smaller matrices (or if eigenvalues only are desired). 


We first discuss the overall structure of the algorithm, and leave numerical 
details for later. Let 


> 
3 
œ~ 
3 
eee] 
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Suppose that we have the eigendecompositions of Tı and T>: T; = QiAiQ?. 
These will be computed recursively by this same algorithm. We relate the 
eigenvalues of T to those of Tı and T as follows. 


= Tı 0 T 
E= | 0 T | + bm vv 
— | au 0 T 
= 0 QoAo Qt + bmVU 
= Qi 0 AY T Qt 0 
= | 0 Qo Ao + bm UU 0 Qt ; 
where 
= Q? 0 _ J last column of QT 
-| 0 QF — | first column of QF 
since v = [0,...,0,1,1,0,...,0]7. Therefore, the eigenvalues of T are the same 
as those of the similar matrix D + puu? where D = | n ic ] is diagonal, 


p = bm is a scalar, and u is a vector. Henceforth we will assume without loss 
of generality that the diagonal d,,...,d, of D is sorted: dn <--- < dı. 

To find the eigenvalues of D+ puu?, assume first that D—XJ is nonsingular, 
and compute the characteristic polynomial as follows: 


det(D + puu? — AI) = det((D — AI)(I + p(D — )~!uu?)). (5.13) 


Since D — XI is nonsingular, det(I + p(D — \)~'uu?) = 0 whenever À is an 
eigenvalue. Note that I+ p(D—.)~!uu* is the identity plus a rank-1 matrix; 
the determinant of such a matrix is easy to compute: 


LEMMA 5.1. If x and y are vectors, det(I + xy?) = 1 + ytz. 


The proof is left to Question 5.14. 
Therefore 


2 


det (I+p(D—A)~tuu™) = 1+puT(D-d)“1u = 1+0 = = = f(A), (5-14) 
e 
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Fig. 5.2. Graph of f(A) =1+ ay + z5 + z5 | wy. 


and the eigenvalues of T are the roots of the so-called secular equation f(A) = 0. 
If all d; are distinct and all u; = 0 (the generic case), the function f(A) has 
the graph shown in Figure 5.2 (for n = 4 and p > 0). 

As we can see, the line y = 1 is a horizontal asymptote, and the lines 
A = d; are vertical asymptotes. Since f’(A) = pT, a > 0, the function 
is strictly increasing except at A = d;. Thus the roots of f(A) are interlaced 
by the d;, and there is one more root to the right of dı (dı = 4 in Figure 5.2). 
(If p < 0, then f(A) is decreasing and there is one more root to the left of dn.) 
Since f(A) is monotonic and smooth on the intervals (d;,d;+1), it is possible 
to find a version of Newton’s method that converges fast and monotonically 
to each root, given a starting point in (d;,dj,1). We discuss details later in 
this section. All we need to know here is that in practice Newton converges 
in a bounded number of steps per eigenvalue. Since evaluating f(A) and f’(A) 
costs O(n) flops, finding one eigenvalue costs O(n) flops, and so finding all n 
eigenvalues of D + puu? costs O(n?) flops. 

It is also easy to derive an expression for the eigenvectors of D + uu’. 


LEMMA 5.2. If a is an eigenvalue of D+ puu”, then (D—al)~1u is its eigen- 
vector. Since D — al is diagonal, this costs O(n) flops to compute. 


Proof. 


(D + puu?)[(D—al)u) = (D-—al+alI+ puuf)(D — al) tu 
u +a(D -— al) tu + uļpu? (D — al) 


220 Applied Numerical Linear Algebra 


u+a(D—al)tu-u 
since pu! (D — al)~4u+1= f(a) =1 
a|(D —alI)~'u] as desired. 


Evaluating this formula for all n eigenvectors costs O(n?) flops. Unfor- 
tunately, this simple formula for the eigenvectors is not numerically stable, 
because two very close values of a; can result in nonorthogonal computed 
eigenvectors u;. Finding a stable alternative took over a decade from the orig- 
inal formulation of this algorithm. We discuss details later in this section. 

The overall algorithm is recursive. 


ALGORITHM 5.2. Finding eigenvalues and eigenvectors of a symmetric tridi- 
agonal matrix using divide-and-conquer: 


proc dc-eig (T, Q, A) 2.8 from input T compute 
outputs Q and A where T = QAQT 


if T is 1-by-1 
return Q=1,4 =T 
else 
T 0 
form T = | D T, | + bmvvt 


call dc_eig (T1, Q1, A1) 
call dc_eig (T2, Q2, A2) 
form D + puu? from M, Ao, Q1, Qe 
find eigenvalues A and eigenvectors Q' of D + puu? 
form Q= | Qi 0 | -Q = eigenvectors of T 
0 Qe 
return Q and A 
endif 


We analyze the complexity of Algorithm 5.2 as follows. Let t(n) be the 
number of flops to run dc_eig on an n-by-n matrix. Then 


t(n) = 2071/2) for the 2 recursive calls to dc_eig(T;, Qi, Ai) 
+O(n?) to find the eigenvalues of D + puu” 
+O(n?) to find the eigenvectors of D + puu? 

3 : Qı 0 1 
+c- to multiply Q = Q. 
cn o multiply Q K AE 


If we treat Q1, Q2, and Q’ as dense matrices and use the standard matrix 
multiplication algorithm, the constant in the last line is c = 1. Thus we see 
that the major cost in the algorithm is the matrix multiplication in the last 
line. Ignoring the O(n?) terms, we get t(n) = 2t(n/2) + cn?. This geometric 
sum can be evaluated, yielding t(n) ~ cfn’ (see Question 5.15). In practice, c 
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is usually much less than 1, because a phenomenon called deflation makes Q’ 
quite sparse. 

After discussing deflation in the next section, we discuss details of solv- 
ing the secular equation, and computing the eigenvectors stably. Finally, we 
discuss how to accelerate the method by exploiting FMM techniques used in 
electrostatic particle simulation [122]. These sections may be skipped on a first 
reading. 


Deflation 


So far in our presentation we have assumed that the d; are distinct, and the 
u; nonzero. When this is not the case, the secular equation f(A) = 0 will 
have k < n vertical asymptotes, and so k < n roots. But it turns out that 
the remaining n — k eigenvalues are available very cheaply: If di = dj41, or 
if u; = 0, one can easily show that d; is also an eigenvalue of D + puu! (see 
Question 5.16). This process is called deflation. In practice we use a threshold 
and deflate d; either if it is close enough to di+ı or if u; is small enough. 

In practice, deflation happens quite frequently: In experiments with ran- 
dom dense matrices with uniformly distributed eigenvalues, over 15% of the 
eigenvalues of the largest D + puu? deflated, and in experiments with random 
dense matrices with eigenvalues approaching 0 geometrically, over 85% de- 
flated! It is essential to take advantage of this behavior to make the algorithm 
fast [58, 208]. 

The payoff in deflation is not in making the solution of the secular equation 
faster; this costs only O(n?) anyway. The payoff is in making the matrix 
multiplication in the last step of the algorithm fast. For if u; = 0, then the 
corresponding eigenvector is e;, the ith column of the identity matrix (see 
Question 5.16). This means that the ith column of Q’ is e;, so no work is 
needed to compute the ith column of Q in the two multiplications by Qı and 
Q2. There is a similar simplification when d; = di+1. When many eigenvalues 
deflate, much of the work in the matrix multiplication can be eliminated. This 
is borne out in the numerical experiments presented in section 5.3.6. 


Solving the Secular Equation 


When some w; is small but too large to deflate, a problem arises when trying to 
use Newton’s method to solve the secular equation. Recall that the principle 
of Newton’s method for updating an approximate solution A; of f(A) = 0 is 


1. to approximate the function f(A) near \ = A; with a linear function /()), 
whose graph is a straight line tangent to the graph of f(A) at à = Aj, 


2. to let A;41 be the zero of this linear approximation: 1(A;41) = 0. 


The graph in Figure 5.2 offers no apparent difficulties to Newton’s method, 
because the function f(A) appears to be reasonably well approximated by 
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Fig. 5.3. Graph of f(A) =14 20 4 4 wee. 


straight lines near each zero. But now consider the graph in Figure 5.3, which 
differs from Figure 5.2 only by changing u? from .5 to .001, which is not 
nearly small enough to deflate. The graph of f(A) in the left-hand figure is 
visually indistinguishable from its vertical and horizontal asymptotes, so in the 
right-hand figure we blow it up around one of the vertical asymptotes, A = 2. 
We see that the graph of f(A) “turns the corner” very rapidly and is nearly 
horizontal for most values of A. Thus, if we started Newton’s method from 
almost any ào, the linear approximation /(\) would also be nearly horizontal 
with a slightly positive slope, so Ay would be an enormous negative number, a 
useless approximation to the true zero. 


Newton’s method can be modified to deal with this situation as follows. 
Since f(A) is not well approximated by a straight line I(x), we approximate it 
by another simple function h(x). There is nothing special about straight lines; 
any approximation h(A) that is both easy to compute and has zeros that are 
easy to compute can be used in place of I(x) in Newton’s method. Since f(A) 
has poles at d; and dj; and these poles dominate the behavior of f(A) near 
them, it is natural when seeking the root in (di, di+1) to choose h(X) to have 
these poles as well, i.e., 


C1 C2 
E ea 


+ C3. 


There are several ways to choose the constants c1, c2, and c3 so that h(A) 
approximates f(A); we present a slightly simplified version of the one used 
in the LAPACK routine slaed4 [170, 44]. Assuming for a moment that we 
have chosen c1, c2, and c3, we can easily solve h(X) = 0 for À by solving the 
equivalent quadratic equation 


cı (di+1 5 A) + c2(di = A) + c3(di — A)(di+1 — A) =N 
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Given the approximate zero Aj, here is how we compute cj, c2, and c3 so that 
for A near Aj 


C1 C2 ; 
RESA EE, 


Write 
i 


u? 7 u? 
à)=1 : R=] À A). 
FO) N a + (A) + YCA) 


For A € (di, di+1), Y1 (A) is a sum of negative terms and w2(A) is a sum of pos- 
itive terms. Thus both w~(A) and ~2(A) can be computed accurately, whereas 
adding them together would likely result in cancellation and loss of relative 


accuracy in the sum. We now choose cı and ¢ so that 
ha (à) oe ae tisfi 
= ĉ +—— satisfies 
1 URI ees 


hy (Aj) = vi(aj) and hj) = Yj). (5.15) 
This means that the graph of h (A) (a hyperbola) is tangent to the graph of 
w1(A) at A = Aj. The two conditions in equation (5.15) are the usual conditions 
in Newton’s method, except instead of using a straight line approximation, we 
use a hyperbola. It is easy to verify that cı = {(A;)(d; — àj)? and ¢ = 
W1(Aj) — Yi Az) (di — Az). (See Question 5.17.) 
Similarly, we choose cz and €2 so that 
ho(A) = ĉa + g y satisfies 
ho(Aj) = P2(àj) and hg (Aj) = w3(à;). (5.16) 
Finally, we set 
h(A) = 1+hi(A)+ ha(A) 
a X C1 C2 
1 H 
(1+ ¢ + ĉ2) i Ue ees 


C1 C2 
Po a 


= c3c 


EXAMPLE 5.8. For example, in the example in Figure 5.3, if we start with 
ào = 2.5, then 


1.1111 -1078 1.1111 - 1078 
h À = H 1, 
(a 2—3) S 
and its graph is visually indistinguishable from the graph of f(A) in the right- 
hand figure. Solving h(àı) = 0, we get A; = 2.0011, which is accurate to 4 
decimal digits. Continuing, 2 is accurate to 11 digits, and A3 is accurate to 
all 16 digits. © 
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The algorithm used in LAPACK routine slaed4 is a slight variation on 
the one described here (the one here is called the Middle Way in [170]). The 
LAPACK routine averages two to three iterations per eigenvalue to converge 
to full machine precision, and never took more than seven steps in extensive 
numerical tests. 


Computing the Eigenvectors Stably 


Once we have solved the secular equation to get the eigenvalues a; of D+ puu", 
Lemma 5.2 provides a simple formula for the eigenvectors: (D — a;l) tu. 
Unfortunately, the formula can be unstable [58, 88, 232], in particular when two 
eigenvalues a; and a;41 are very close together. Intuitively, the problem is that 
(D — a;l) tu and (D — aj411)~'n are “very close” formulas yet are supposed 
to yield orthogonal eigenvectors. More precisely, when a; and aj;+1 are very 
close, they must also be close to the d; between them. Therefore, there is a 
great deal of cancellation, either when evaluating di — a; and di — aj41 or when 
evaluating the secular equation during Newton iteration. Either way, di — a; 
and di — aj41 may contain large relative errors, so the computed eigenvectors 
(D —a;)~tu and (D — a;41)~1u are quite inaccurate and far from orthogonal. 


Early attempts to address this problem [88, 232] used double precision 
arithmetic (when the input data was single precision) to solve the secular 
equation to high accuracy so that d; — a; and di — a;41 could be computed to 
high accuracy. But when the input data is already in double precision, this 
means quadruple precision would be needed, and this is not available in many 
machines and languages, or at least not cheaply. As described in section 1.5, 
it is possible to simulate quadruple precision using double precision [232, 202]. 
This can be done portably and relatively efficiently, as long as the underlying 
floating point arithmetic rounds sufficiently accurately. In particular, these 
simulations require that fl(a +b) = (a + b)(1 + ô) with |ô| = O(e), barring 
overflow or underflow (see section 1.5 and Question 1.18). Unfortunately, the 
Cray 2, YMP, and C90 do not round accurately enough to use these efficient 
algorithms. 


Finally, an alternative formula was found that makes simulating high pre- 
cision arithmetic unnecessary. It is based on the following theorem of Lowner 
(127, 177]. 


THEOREM 5.10. Lowner. Let D = diag(d1,...,dn) be diagonal with dn < 
+++ < dy. Let an <-++ < ay be given, satisfying the interlacing property 


dn < an <°- < di+1 < igi < di < Qi <- < dy < ay. 


Then there is a vector t such that the a; are the exact eigenvalues of D= 
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D+ at". The entries of û are given by 


Proof. The characteristic polynomial of D can be written both as det(D — 
AI) = [ [j= (a; — A) and (using equations (5.13) and (5.14)) as 


n n ^2 
de(ô -A = |] I (a; -y a+) 

j=l j=l J 

j=l j=1 ? 


or 


Using the interlacing property, we can show that the fraction on the right is 
positive, so we can take its square root to get the desired expression for t;. 

Here is the stable algorithm for computing the eigenvalues and eigenvectors 
(where we assume for simplicity of presentation that p = 1). 


ALGORITHM 5.3. Compute the eigenvalues and eigenvectors of D+ uu! . 


Solve the secular equation 1+ X; qty = 0 to get the eigenvalues 
a; of D+ uul. 

Use Lowner’s theorem to compute t so that the a; are “exact” 
eigenvalues of D+ tia? . 

Use Lemma 5.2 to compute the eigenvectors of D+ vi". 
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Here is a sketch of why this algorithm is numerically stable. By analyz- 
ing the stopping criterion in the secular equation solver,!® one can show that 
uu? — Ga" ll < O(e)(|D]l2 + luu? ||2); this means that D+uu? and D+ aa? 
are so close together that the eigenvalues and eigenvectors of D + ati? are 
stable approximations of the eigenvalues and eigenvectors of D + uu. Next 
note that the formula for û; in Lowner’s theorem requires only differences of 
floating point numbers d; — d; and a; — d;, products and quotients of these 
differences, and a square root. Provided that the floating point arithmetic is 
accurate enough that fl(a © b) = (a © b)(14+ ô) for all © € {+,-,x,/} and 
sqrt(a) = /a-(1+6) with |6| = O(e), this formula can be evaluated to high 
relative accuracy. In particular, we can easily show that 


z a, 1/2 


Mil- 1” 
Mi=, jail Fy -di i) 


Mi=, jail dy -di 


with |ô| = O(e), barring overflow or underflow. Similarly, the formula in 
Lemma 5.2 can also be evaluated to high relative accuracy, so we can compute 
the eigenvectors of D + a&7 to high relative accuracy. In particular, they are 
very accurately orthogonal. 

In summary, provided the floating point arithmetic is accurate enough, 
Algorithm 5.3 computes very accurate eigenvalues and eigenvectors of a matrix 
D+iat" that differs only slightly from the original matrix D+-uu!. This means 
that it is numerically stable. 

The reader should note that our need for sufficiently accurate floating point 
arithmetic is precisely what prevented the simulation of quadruple precision 
proposed in [232, 202] from working on some Crays. So we have not yet 
succeeded in providing an algorithm that works reliably on these machines. 
One more trick is necessary: The only operations that fail to be accurate 
enough on some Crays are addition and subtraction, because of the lack of a so- 
called guard digit in the floating point hardware. This means that the bottom- 
most bit of an operand may be treated as 0 during addition or subtraction, even 
if it is 1. If most higher-order bits cancel, this “lost bit” becomes significant. 
For example, subtracting 1 from the next smaller floating point number, in 
which case all leading bits cancel, results in a number twice too large on the 
Cray C90 and in 0 on the Cray 2. But if the bottom bit is already 0, no 
harm is done. So the trick is to deliberately set all the bottom bits of the d; 
to 0 before applying Lowner’s theorem or Lemma 5.2 in Algorithm 5.3. This 
modification causes only a small relative change in the d; and qi, and so the 
algorithm is still stable.!® 


18Tn more detail, the secular equation solver must solve for a; — d; or di+1 — ai (whichever 
is smaller), not a;, to attain this accuracy. 

To set the bottom bit of a floating point number 8 to 0 on a Cray, one can show that it 
is necessary only to set 3 := (8+ 6) — 8. This inexpensive computation does not change at 
all on a machine with accurate binary arithmetic (barring overflow, which is easily avoided). 
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This algorithm is described in more detail in [127, 129] and implemented 
in LAPACK routine slaed3. 


Accelerating Divide-and-Conquer using the FMM 


The FMM [122] was originally invented for the completely different problem of 
computing the mutual forces on n electrically charged particles or the mutual 
gravitational forces on n masses. We only sketch how these problems are 
related to finding eigenvalues and eigenvectors, leaving details to [129]. 

Let dı through dn be the three-dimensional position vectors of n particles 
with charges z;-u;. Let a, through a, be the position vectors of n other par- 
ticles with unit positive charges. Then the inverse-square law tells us that the 
force on the particle at a; due to the particles at dı through dn is proportional 
to 


fi = 5 ziyuildi = aj) , 


2. Td; i 


If we are modeling electrostatics in two dimensions instead of three, the force 
law changes to the inverse-first-power law?? 


Since d; and a; are vectors in R?, we can also consider them to be complex 
variables. In this case 


where d; and a; are the complex conjugates of d; and aj, respectively. If d; 
and a; happen to be real numbers, this simplifies further to 


n 

f Zii 
D En 
i=1 ° I 


Now consider performing a matrix-vector multiplication fT = zT Q', where 
Q' is the eigenvector matrix of D+-uu!. From Lemma 5.2, Qij = uisj/(di— aj), 


But on a Cray it sets the bottom bit to 0. The reader familiar with Cray arithmetic is 
invited to prove this. The only remaining difficulty is preventing an optimizing compiler 
from removing this line of code entirely, which some overzealous optimizers might do; this is 
accomplished (for the current generation of compilers) by computing (8 + (3) with a function 
call to a function stored in a separate file from the main routine. We hope that by the time 
compilers become clever enough to optimize even this situation, Cray arithmetic will have 
died out. 

?°Technically, this means the potential function satisfies Poisson’s equation in two space 
coordinates rather than three. 
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where s; is a scale factor chosen so that column j is a unit vector. Then the 
jth entry of fT = z7Q’ is 


fj =) 2%; ty a 
i=1 i=1 * J 


which is the same sum as for the electrostatic force, except for the scale fac- 
tor s;. Thus, the most expensive part of the divide-and-conquer algorithm, 
the matrix multiplication in the last line of Algorithm 5.2, is equivalent to 
evaluating electrostatic forces. 


Evaluating this sum for j = 1,...,n appears to require O(n?) flops. The 
FMM and others like it [122, 23] can be used to approximately (but very 
accurately) evaluate this sum in O(n - logn) time (or even O(n)) time instead. 
(See the lectures on “Fast Hierarchical Methods for the N-body Problem” at 
PARALLEL -HOMEPAGE for details.) 

But this idea alone is not enough to reduce the cost of divide-and-conquer 
to O(n - log? n). After all, the output eigenvector matrix Q has n? entries, 
which appears to mean that the complexity should be at least n?. So we 
must represent Q using fewer than n? independent numbers. This is possible, 
because an n-by-n tridiagonal matrix has only 2n — 1 “degrees of freedom” 
(the diagonal and superdiagonal entries), of which n can be represented by the 
eigenvalues, leaving n—1 for the orthogonal matrix Q. In other words, not every 
orthogonal matrix can be the eigenvector matrix of a symmetric tridiagonal T; 
only an (n — 1)-dimensional subset of the entire (n(n — 1)/2)-dimensional set 
of orthogonal matrices can be such eigenvector matrices. 

We will represent Q using the divide-and-conquer tree computed by Algo- 


: a ]- Q’, we will store all the 


Q’ matrices, one at each node in the tree. And we will not store Q’ explicitly 
but rather just store D, p, u, and the eigenvalues a; of D + puuf. We can do 
this since this is all we need to use the FMM to multiply by Q’. This reduces 
the storage needed for Q from n? to O(n : logn). Thus, the output of the 
algorithm is a “factored” form of Q consising of all the Q’ factors at the nodes 
of the tree. This is an adequate representation of Q, because we can use the 
FMM to multiply any vector by Q in O(n - log? n) time. 


rithm 5.2. Rather than accumulating Q = [ % 


5.3.4. Bisection and Inverse Iteration 


The bisection algorithm exploits Sylvester’s inertia theorem (Theorem 5.3) 
to find only those k eigenvalues that one wants, at cost O(nk). Recall that 
Inertia(A) = (v,¢,7), where v, Ç, and m are the number of negative, zero, 
and positive eigenvalues of A, respectively. Suppose that X is nonsingular; 
Sylvester’s inertia theorem asserts that Inertia(A) = Inertia(X7 AX). 

Now suppose that one uses Gaussian elimination to factorize A — zI 
LDL’, where L is nonsingular and D diagonal. Then Inertia(A — zI) 
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Inertia(D). Since D is diagonal, its inertia is trivial to compute. (In what 
follows, we use notation such as “# di; < 0” to mean “the number of values of 
di; that are less than zero.” ) 
Inertia(A — zI) = (# dii < 0, # du = 0, # dy > 0) 
= (# negative eigenvalues of A — zJ, 
# zero eigenvalues of A — zI, 
# positive eigenvalues of A — zI) 
(# eigenvalues of A < z, 
# eigenvalues of A = z, 
# eigenvalues of A > z). 

Suppose zı < z2 and we compute Inertia (A — z211) and Inertia (A — z2). 
Then the number of eigenvalues in the interval [z1, z2) equals (# eigenvalues 
of A < z2) — (# eigenvalues of A < z1). 

To make this observation into an algorithm, define 


Negcount(A,z) = # eigenvalues of A < z. 


ALGORITHM 5.4. Bisection: Find all eigenvalues of A inside [a,b) to a given 
error tolerance tol: 


Na = Negcount(A, a) 
np» = Negcount(A, b) 
if Na = Na, quit ... because there are no eigenvalues in [a, b) 
put [a, na, b, ny] onto Worklist 
/* Worklist contains a list of intervals [a,b) containing 
eigenvalues n — na +1 through n — ny, which the algorithm 
will repeatedly bisect until they are narrower than tol. */ 
while Worklist is not empty do 
remove [low, Niow, UP, Nup] from Worklist 
if (up—low < tol) then 
print “there are Nnup — Now eigenvalues in [low, up)” 
else 
mid = (low + up)/2 
Nmia = Negcount (A, mid) 
if Nmia > Now then ... there are eigenvalues in [low, mid) 
put [low, Now, Mid, Nig] onto Worklist 
end if 
if Nup > Nmia then ... there are eigenvalues in [mid, up) 
put [mid, Nmid, UP, Nup| onto Worklist 
end if 
end if 
end while 


If aj > +--+ > Qn are eigenvalues, the same idea can be used to compute a; 
for j = jo,jo+1,.-.,j1- This is because we know Qn-nup+1 through n_n,,, 
lie in the interval [low,up). 
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If A were dense, we could implement Negcount(A, z) by doing symmetric 
Gaussian elimination with pivoting as described in section 2.7.2. But this 
would cost O(n?) flops per evaluation and so not be cost effective. On the other 
hand, Negcount(A, z) is quite simple to compute for tridiagonal A, provided 
that we do not pivot: 


LSet ae | 1 | 
so ay — z = dy, d\l, = bı and thereafter l idi- + di = a; — z, djl; = bi. 
Substituting l; = b;/d; into 2 di + di = a; — z yields the simple recurrence 


bi 
ca 
Notice that we are not pivoting, so you might think that this is dangerously 


unstable, especially when d;_1 is small. In fact, since it is tridiagonal, it can 
be shown to be very stable [72, 73, 154]. 


d; = (a; = z) z (5.17) 


LEMMA 5.3. The d; computed in floating point arithmetic, using equation (5.17), 
have the same signs (and so compute the same Inertia) as the d; computed ex- 
actly from A, where A is very close to A: 

(A)ii = âi =a; and (Âi = b; = bi(1 + €i), where Jeil < 2.5 + O(e°). 
Proof. Let d; denote the quantities computed using equation (5.17) including 
rounding errors: 


di = (ai T z)\(1 + €244) = aC te) 
i—1 


where all the e’s are bounded by machine roundoff ¢ in magnitude, and their 
subscripts indicate which floating point operation they come from (for example, 
€_2; is from the second subtraction when computing d;). Define the new 


G (1 + Eji) (1 + € 2i), (5.18) 


variables 
5 d; 
ES 1 
(1 + €i) 1+ €_2,4) (5 9) 
A 1+ exi (l tE; 1/2 
b-1 = bi- ( i Ja) = bi- (1 + 6). 


(1 +e- 1g) Cle 214-4 Gl + €—,2,i-1) 
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Note that d; and d; have the same signs, and |e;| < 2.5¢ + O(e7). Substituting 
(5.19) into (5.18) yields 


completing the proof. 

A complete analysis must take the possibility of overflow or underflow into 
account. Indeed, using the exception handling facilities of IEEE arithmetic, 
one can safely compute even when some dj_, is exactly zero! For in this case 
di = —0o0, dj41 = Qi+1 — Z, and the computation continues unexceptionally 
(72, 80]. 

The cost of a single call to Negcount on a tridiagonal matrix is 4n flops. 
Therefore the overall cost to find k eigenvalues is O(kn). This is implemented 
in LAPACK routine sstebz. 

Note that Bisection converges linearly, with one more bit of accuracy for 
each bisection of an interval. There are many ways to accelerate convergence, 
using algorithms like Newton’s method and its relatives, to find zeros of the 
characteristic polynomial (which may be computed by multiplying all the d;’s 
together) [171, 172, 173, 174, 176, 267]. 

To compute eigenvectors once we have computed (selected) eigenvalues, we 
can use inverse iteration (Algorithm 4.2); this is available in LAPACK routine 
sstein. Since we can use accurate eigenvalues as shifts, convergence usually 
takes one or two iterations. In this case the cost is O(n) flops per eigenvector, 
since one step of inverse iteration requires us only to solve a tridiagonal system 
of equations (see section 2.7.3). When several computed eigenvalues dj,..., a; 
are close together, their corresponding computed eigenvectors q;,...,@; may 
not be orthogonal. In this case the algorithm reorthogonalizes the computed 
eigenvectors, computing the QR decomposition |ĝ;, . . . , ĝj] = QR and replacing 
each ĝẹ with the kth column of Q; this guarantees that the q, are orthonormal. 
This QR decomposition is usually computed using the MGS orthogonalization 
process (Algorithm 3.1); i.e., each computed eigenvector has any components 
in the directions of previously computed eigenvectors explicitly subtracted out. 
When the cluster size k = j — i +1 is small, the cost O(k?n) of this reorthog- 
onalization is small, so in principle all the eigenvalues and all the eigenvectors 
could be computed by Bisection followed by inverse iteration in just O(n”) flops 
total. This is much faster than the O(n?) cost of QR iteration or divide-and- 
conquer (in the worst case). The obstacle to obtaining this speedup reliably 
is that if the cluster size k = j — i + 1 is large, i.e., a sizable fraction of n, 
then the total cost rises to O(n?) again. Worse, there is no guarantee that the 
computed eigenvectors are accurate or orthogonal. (The trouble is that after 
reorthogonalizing a set of nearly dependent g;,, cancellation may mean some 
computed eigenvectors consist of little more than roundoff errors.) 

There has been recent progress on this problem, however [103, 199, 201], 
and it now appears possible that inverse iteration may be “repaired” to provide 
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accurate, orthogonal eigenvectors without spending more than O(n) flops per 
eigenvector to reorthogonalize. This would make Bisection and “repaired” 
inverse iteration the algorithm of choice in all cases, no matter how many 
eigenvalues and eigenvectors are desired. We look forward to describing this 
algorithm in a future edition. 


Note that Bisection and inverse iteration are “embarrassingly parallel,” 
since each eigenvalue and later eigenvector may be found independently of 
the others. (This presumes that inverse iteration has been repaired so that 
reorthogonalization with many other eigenvectors is no longer necessary.) This 
makes these algorithms very attractive for parallel computers [75]. 


5.3.5. Jacobi’s Method 


Jacobi’s method does not start by reducing A to tridiagonal from as do the 
previous methods but instead works on the original dense matrix. Jacobi’s 
method is usually much slower than the previous methods and remains of 
interest only because it can sometimes compute tiny eigenvalues and their 
eigenvectors with much higher accuracy than the previous methods and can be 
easily parallelized. Here we describe only the basic implementation of Jacobi’s 
method, and defer the discussion of high accuracy to section 5.4.3. 


Given a symmetric matrix A = Ao, Jacobi’s method produces a sequence 
Aj, A2,... of orthogonally similar matrices, which eventually converge to a 
diagonal matrix with the eigenvalues on the diagonal. Aj;11 is obtained from 
A; by the formula Aj41 = gE AiJi, where J; is an orthogonal matrix called a 
Jacobi rotation. Thus 


Am = Te, pA eT st 
5 ad tem. mar nae, eae =... 
= Ie aie JE Agdoe:<Tpa1 


JAS: 


If we choose each J; appropriately, Am approaches a diagonal matrix A for 
large m. Thus we can write A ~ J’ AJ or JAJ? ~ A. Therefore, the columns 
of J are approximate eigenvectors. 


We will make J’ AJ nearly diagonal by iteratively choosing J; to make one 
pair of offdiagonal entries of A;;1 = J? A; J; zero at a time. We will do this by 
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choosing J; to be a Givens rotation, 


| 
sin 0 cos 0 


where 0 is chosen to zero out the j,k and k,7 entries of A;,1. To determine 0 
(or actually cos @ and sin 8), write 


fies a | _ bee —sin@ i |2 a | E el 


G41) G41) sin cos@ sind cos 
_ à 0 
E O re |’ 
where A; and A2 are the eigenvalues of 


g] 
aki Akk 


j 


It is easy to compute cos 0 and sin 0: Multiplying out the last expression, using 
symmetry, abbreviating c = cos 0 and s = sin 0, and dropping the superscript 
(i) for simplicity yield 


Ay 0l ajje + apps? + 28cajk sc(akk — ajj) + ajele — s*) 
0 ra] | sean, — ajj) + ajke — 5?) ajjs? + apk? — 28CA;k 
Setting the offdiagonals to 0 and solving for 0 we get 0 = sc(akk — ajj) + 
ajx(c? — 8”), or 
2 


2 
Qi; — ükk C—8 cos 20 
cE = = = cot 20 =T. 


2a; 2sc  sin20 


We now let t = £ = tan@ and note that t? + 2rt— 1 = 0 to get (via the 


: _ __ sign(r) = 1 
quadratic formula) t ne ae 


derivation in the following algorithm. 


and s = t- c. We summarize this 


ALGORITHM 5.5. Compute and apply a Jacobi rotation to A in coordinates 
j,k: 


234 Applied Numerical Linear Algebra 


proc Jacobi-Rotation (A, j, k) 
if |ajz| is not too small 
T = (ajj — Onn)/(2 + ajk) 
t = sign(r)/(\r| + v1 +7?) 
c=1/v1 +t 
s=c:t 
A= R'(j,k,0)- A: R(j,k, 0) ... where c = cos@ and s = sin 0 
if eigenvectors are desired 
J=J-R(j,k,8) 
end if 
end if 


The cost of applying R(j,k,@) to A (or J) is only O(n) flops, because only 
rows and columns j and k of A (and columns j and k of J) are modified. The 
overall Jacobi algorithm is then as follows. 


ALGORITHM 5.6. Jacobi’s method to find the eigenvalues of a symmetric ma- 
trix: 


repeat 

choose a j,k pair 

call Jacobi-Rotation(A, j, k) 
until A is sufficiently diagonal 


We still need to decide how to pick j, k pairs. There are several possibilities. 
To measure progress to convergence and describe these possibilities, we define 


off(A) = D ahy. 


1<j<k<n 


Thus off (A) is the root-sum-of-squares of the (upper) offdiagonal entries of 
A, so A is diagonal if and only if off(A) = 0. Our goal is to make off (A) ap- 
proach 0 quickly. The next lemma tells us that off(A) decreases monotonically 
with every Jacobi rotation. 


LEMMA 5.4. Let A’ be the matrix after calling Jacobi-Rotation(A, j,k) for any 
j =k. Then off?(A’) = off?(A) — a‘. 


Proof. Note that A’ = A except in rows and columns j and k. Write 
off?(A) = `> ary + ay — S? + asp 


1<j'<k <n 
ji =jork'’=k 
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and similarly off?(A’) = S’? + a’2,, = S’, since a’. = 0 after calling Jacobi- 
Rotation(A, j,k). Since ||X||7 = ||QX||r and ||X||r = ||XQ||r for any X and 
any orthogonal Q, we can show S? = S’?. Thus off?(A’) = off?(A) — as, as 
desired. 

The next algorithm was the original version of the algorithm (from Jacobi 
in 1846), and it has an attractive analysis although it is too slow to use. 


ALGORITHM 5.7. Classical Jacobi’s algorithm: 


while off(A) > tol (where tol is the stopping criterion set by user) 
choose j and k so ajg is the largest offdiagonal entry in magnitude 
call Jacobi-Rotation (A, j, k) 

end while 


THEOREM 5.11. After one Jacobi rotation in the classical Jacobi’s algorithm, 
we have off(A’) < 4/1— $ off(A) where N = n(n) = the number of su- 
perdiagonal entries of A. After k Jacobi-Rotations off(-) is no more than 
(1 — 4)*/? off(A). 


Proof. By Lemma 5.4, after one step, off?(A’) = off? (A) = aĵ, where ajx is the 
largest offdiagonal entry. Thus off?(A) < nd a, or a‘, > aa ppt (A); 
so off? (A) — ar, <(1- 4) off? (A) as desired. 

So the classical Jacobi’s algorithm converges at least linearly with the error 


(measured by off(A)) decreasing by a factor of at least 4/1 — 4 at a time. In 
fact, it eventually converges quadratically. 


THEOREM 5.12. Jacobi’s method is locally quadratically convergent after N 
steps (i.e., enough steps to choose each ajk once). This means that for i large 
enough 


off (Ain) = O(off?(Aj)). 


In practice, we do not use the classical Jacobi’s algorithm because searching 
for the largest entry is too slow: We would need to search ae ” entries for every 
Jacobi rotation, which costs only O(n) flops to perform, and so for large n the 
search time would dominate. Instead, we use the following simple method to 
choose j and k. 


ALGORITHM 5.8. Cyclic-by-row-Jacobi: Sweep through the offdiagonals of A 
rowwise. 


repeat 
forj=lton-1 
fork=jt+l1ton 
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call Jacobi-Rotation(A, j, k) 
end for 
end for 
until A is sufficiently diagonal 


A no longer changes when Jacobi-Rotation(A, j, k) chooses only c = 1 and 
s = 0 for an entire pass through the inner loop. The cyclic Jacobi’s algo- 
rithm is also asymptotically quadratically convergent like the classical Jacobi’s 
algorithm [260, p. 270]. 

The cost of one Jacobi “sweep” (where each j,k pair is selected once) is 
approximately half the cost of reduction to tridiagonal form and the compu- 
tation of eigenvalues and eigenvectors using QR iteration, and more than the 
cost using divide-and-conquer. Since Jacobi’s method often takes 5-10 sweeps 
to converge, it is much slower than the competition. 


5.3.6. Performance Comparison 


In this section we analyze the performance of the three fastest algorithms 
for the symmetric eigenproblem: QR iteration, Bisection with inverse itera- 
tion, and divide-and-conquer. More details may be found in [10, chap. 3] or 
NETLIB/lapack/lug/lapack_lug.html. 

We begin by discussing the fastest algorithm and later compare the others. 
We used the LAPACK routine ssyevd. The algorithm to find only eigenval- 
ues is reduction to tridiagonal form followed by QR iteration, for an operation 
count of $n? + O(n?) flops. The algorithm to find eigenvalues and eigenvectors 
is tridiagonal reduction followed by divide-and-conquer. We timed ssyevd on 
an IBM RS6000/590, a workstation with a peak speed of 266 Mflops, although 
optimized matrix-multiplication runs at only 233 Mflops for 100-by-100 matri- 
ces and 256 Mflops for 1000-by-1000 matrices. The actual performance is given 
in the table below. The “Mflop rate” is the actual speed of the code in Mflops, 
and “Time / Time(Matmul)” is the time to solve the eigenproblem divided by 
the time to multiply two square matrices of the same size. We see that for large 
enough matrices, matrix-multiplication and finding only the eigenvalues of a 
symmetric matrix are about equally expensive. (In contrast, the nonsymmet- 
ric eigenproblem is least 16 times more costly [10].) Finding the eigenvectors 
as well is a little under three times as expensive as matrix-multiplication. 


Dimension Eigenvalues only Eigenvalues 
and eigenvectors 
Mflop rate Time / Mflop rate Time / 
Time(Matmul) Time(Matmul) 
100 72 3.1 72 9.3 
1000 160 1.1 174 2.8 


Now we compare the relative performance of QR iteration, Bisection with 
inverse iteration, and divide-and-conquer. In Figures 5.4 and 5.5 these are la- 
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beled QR, BZ (for the LAPACK routine sstebz, which implements Bisection), 
and DC, respectively. The horizontal axis in these plots is matrix dimension, 
and the vertical axis is time divided by the time for DC. Therefore, the DC 
curve is a horizontal line at 1, and the other curves measure how many times 
slower BZ and QR are than DC. Figure 5.4 shows only the time for the tridi- 
agonal eigenproblem, whereas Figure 5.5 shows the entire time, starting from 
a dense matrix. 

In the top graph in Figure 5.5 the matrices tested were random symmetric 
matrices; in Figure 5.4, the tridiagonal matrices were obtained by reducing 
these dense matrices to tridiagonal form. Such random matrices have well- 
separated eigenvalues on average, so inverse iteration requires little or no ex- 
pensive reorthogonalization. Therefore BZ was comparable in performance 
to DC, although QR was significantly slower, up to 15 times slower in the 
tridiagonal phase on large matrices. 

In the bottom two graphs, the dense symmetric matrices had eigenvalues 
1, .5, .25,..., .5°7!. In other words, there were many eigenvalues clustered 
near zero, so inverse iteration had a lot of reorthogonalization to do. Thus 
the tridiagonal part of BZ was over 70 times slower than DC. QR was up to 
54 times slower than DC, too, because DC actually speeds up when there is a 
large cluster of eigenvalues; this is because of deflation. 

The distinction in speeds among QR, BZ, and DC is less noticeable in 
Figure 5.5 than in Figure 5.4, because Figure 5.5 includes the common O(n?) 
overhead of reduction to tridiagonal form and transforming the eigenvalues of 
the tridiagonal matrix to eigenvalues of the original dense matrix; this common 
overhead is labeled TRD. Since DC is so close to TRD in Figure 5.5, this means 
that any further acceleration of DC will make little difference in the overall 
speed of the dense algorithm. 


5.4. Algorithms for the Singular Value Decomposition 


In Theorem 3.3, we showed that the SVD of the general matrix G is closely 
related to the eigendecompositions of the symmetric matrices GTG, GG? and 
ee ao 
transformed into algorithms for the SVD. The transformations are not straight- 
forward, however, because the added structure of the SVD can often be ex- 
ploited to make the algorithms more efficient or more accurate [118, 79, 66]. 
All the algorithms for the eigendecomposition of a symmetric matrix A, 
except Jacobi’s method, have the following structure: 


Using these facts, the algorithms in the previous section can be 


1. Reduce A to tridiagonal form T with an orthogonal matrix Qı: A = 
QiTQ;. 


2. Find the eigendecomposition of T: T = QoAQE, where A is the diagonal 
matrix of eigenvalues and Q2 is the orthogonal matrix whose columns 
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Fig. 5.4. Speed of finding eigenvalue and eigenvectors of a symmetric tridiagonal 
matrix, relative to divide-and-conquer. 
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Fig. 5.5. Speed of finding eigenvalue and eigenvectors of a symmetric dense matriz, 


relative to divide-and-conquer. 
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are eigenvectors. 


3. Combine these decompositions to get A = (Q1Q2)A(Q1Q2)!. The columns 
of Q = Q1Qz2 are the eigenvectors of A. 


All the algorithms for the SVD of a general matrix G, except Jacobi’s method, 
have an analogous structure: 


1. Reduce G to bidiagonal form B with orthogonal matrices U} and Vj: 
G = UBV. This means B is nonzero only on the main diagonal and 
first superdiagonal. 


2. Find the SVD of B: B = U&V, where X is the diagonal matrix of 
singular values, and Uz and V2 are orthogonal matrices whose columns 
are the left and right singular vectors, respectively. 


3. Combine these decompositions to get G = (UjU2)5(Vi V2)". The columns 
of U = UU and V = Vi Vz are the left and right singular vectors of G, 
respectively. 


Reduction to bidiagonal form is accomplished by the algorithm in section 4.4.7. 
Recall from the discussion there that it costs Sn’ + O(n?) flops to compute B; 
this is all that is needed if only the singular values © are to be computed. It 
costs another 4n? + O(n?) flops to compute U; and V;, which are needed to 
compute the singular vectors as well. 

The following simple lemma shows how to convert the problem of finding 
the SVD of the bidiagonal matrix B into the eigendecomposition of a symmetric 
tridiagonal matrix T. 


LEMMA 5.5. Let B be an n-by-n bidiagonal matrix, with diagonal a1,..., an 
and superdiagonal bı,...,bn—1. There are three ways to convert the problem of 
finding the SVD of B to finding the eigenvalues and eigenvectors of a symmetric 
tridiagonal matriz. 


zÄ 
1. Let A= | = = ]. Let P be the permutation matrix P = [e1, en+1, €2, 
Cn+2;+++,€n;€2n], where e; is the ith column of the 2n-by-2n identity 


matrix. Then Tps = P'AP is symmetric tridiagonal. The subscript 
“ns” stands for perfect shuffle, because multiplying P times a vector x 
“shuffles” the entries of x like a deck of cards. One can show that Tps 


has all zeros on its main diagonal, and its superdiagonal and subdiagonal 


is a1, 61, Ag, b2,...,bn-1,4n- If Tps£i = Qix; is an eigenpair for Tps, with 

x; a unit vector, then a; = +0;, where c; is a singular value of B, and 
Ui 5 x 

Pr = Zl ns ], where u; and vi are left and right singular vectors of 
oui; 


B, respectively. 
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2. Let Tgpr = BB". Then Tgpr is symmetric tridiagonal with diagonal 
a? + b7,a3 + b3,...,a%_, + b?_1,a2, and superdiagonal and subdiagonal 
a2b1,a3b2,...,Anbn_1. The singular values of B are the square roots 
of the eigenvalues of Tgpr, and the left singular vectors of B are the 
eigenvectors of Tggr. Tpgpr contains no information about the right 
singular vectors of B. 


3. Let Terg = BTB. Then Tgrp is symmetric tridiagonal with diago- 
nal af, az + b?, a + b3,...,a% + b2_,, and superdigonal and subdiagonal 
aby, agb2,...,An,bn-1. The singular values of B are the square roots of 
the eigenvalues of Tarp, and the right singular vectors of B are the eigen- 
vectors of Tarp. Tarp contains no information about the left singular 
vectors of B. 


For a proof, see Question 5.19. 

Thus, we could in principle apply any of QR iteration, divide-and-conquer, 
or Bisection with inverse iteration to one of the tridiagonal matrices from 
Lemma 5.5 and then extract the singular and (perhaps only left or right) 
singular vectors from the resulting eigendecomposition. However, this sim- 
ple approach would sacrifice both speed and accuracy by ignoring the special 
properties of the underlying SVD problem. We give two illustrations of this. 

First, it would be inefficient to run symmetric tridiagonal QR iteration or 
divide-and-conquer on Tps. This is because these algorithms both compute all 
the eigenvalues (and perhaps eigenvectors) of Tps, whereas Lemma 5.5 tells us 
we only need the nonnegative eigenvalues (and perhaps eigenvectors). There 
are some accuracy difficulties with singular vectors for tiny singular values too. 

Second, explicitly forming either Tgpgr or Tgrp is numerically unstable. 
In fact one can lose half the accuracy in the small singular values of B. For 
example, let 7 = ¢/2, so 1 + 7 rounds to 1 in floating point arithmetic. Let 


B=| i A ], which has singular values near v2 and ,/7/2. Then BTB = 
[ ; ey ] rounds to Tprp = | ' : ], an exactly singular matrix. Thus, 


rounding 1+ 7 to 1 changes the smaller computed singular value from its true 
value near J/n/2 = /e/2 to 0. In contrast, a backward stable algorithm should 
change the singular values by no more than O(¢)||B||2 = O(e). In IEEE double 
precision floating point arithmetic, e ~ 10716 and /é/2 = 1078, so the error 
introduced by forming BTB is 10° times larger than roundoff, a much larger 
change. The same loss of accuracy can occur by explicitly forming Tp pr. 

Because of the instability caused by computing Tggr or Tgr pg, good SVD 
algorithms work directly on B or possibly Tps- 

In summary, we describe the practical algorithms used for computing the 
SVD. 


1. QR iteration and its variations. Properly implemented [102], this is the 
fastest algorithm for finding all the singular values of a bidiagonal matrix. 
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Furthermore, it finds all the singular values to high relative accuracy, 
as discussed in section 5.2.1, This means that all the digits of all the 
singular values are correct, even the tiniest ones. In contrast, symmetric 
tridiagonal QR iteration may compute tiny eigenvalues with no relative 
accuracy at all. A different variation of QR iteration [79] is used to 
compute the singular vectors as well: by using QR iteration with a zero 
shift to compute the smallest singular vectors, this variation computes 
the singular values nearly as accurately, as well as getting singular vectors 
as accurately as described in section 5.2.1. But this is only the fastest 
algorithm for small matrices, up to about dimension n = 25. This routine 
is available in LAPACK subroutine sbdsqr. 


. Divide-and-conquer. This is currently the fastest method to find all sin- 


gular values and singular vectors for matrices larger than n = 25. (The 
implementation in LAPACK, sbdsdc, defaults to sbdsqr for small ma- 
trices.) However, divide-and-conquer does not guarantee that the tiny 
singular values are computed to high relative accuracy. Instead, it guar- 
antees only the same error bound as in the symmetric eigenproblem: the 
error in singular value øj is at most O(¢)o, rather than O(e)o;. This 
sufficiently accurate for most applications. 


. Bisection and inverse iteration. One can apply Bisection and inverse 


iteration to Tps of part 1 of Lemma 5.5, to find only the singular values in 
a desired interval. This algorithm is guaranteed to find the singular values 
to high relative accuracy, although the singular vectors may occasionally 
suffer loss of orthogonality as described in section 5.3.4. 


. Jacobi’s method. We may compute the SVD of a dense matrix G by 


applying Jacobi’s method of section 5.3.5 implicitly to GG? or GTG, 
i.e., without explicitly forming either one and so possibly losing stability. 
For some classes of G, i.e., those to which we can profitably apply the 
relative perturbation theory of section 5.2.1, we can show that Jacobi’s 
method computes the singular values and singular vectors to high relative 
accuracy, as described in section 5.2.1. 


The following sections describe some of the above algorithms in more de- 


tail, notably QR iteration and its variation dqds in section 5.4.1; the proof 
of high accuracy of dqds and Bisection in section 5.4.2; and Jacobi’s method 
in section 5.4.3. We omit divide-and-conquer because of its overall similarity 
to the algorithm discussed in section 5.3.3, and refer the reader to [128] for 
details. 


5.4.1. QR Iteration and Its Variations for the Bidiagonal SVD 


There is a long history of variations on QR iteration for the SVD, designed 
to be as efficient and accurate as possible; see [198] for a good survey. The 
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algorithm in the LAPACK routine sbdsqr was originally based on [79] and 
later updated to use the algorithm in [102] in the case when singular values 
only are desired. This latter algorithm, called dqds for historical reasons,?! is 
elegant, fast, and accurate, so we will present it. 

To derive dqds, we begin with an algorithm that predates QR iteration, 
called LR iteration, specialized to symmetric positive definite matrices. 


ALGORITHM 5.9. LR Iteration; Let To be any symmetric positive definite ma- 
trix. The following algorithm produces a sequence of similar symmetric positive 
definite matrices T;: 


i=0 
repeat 
Choose a shift fe smaller than the smallest eigenvalue of T;. 
Compute the Cholesky factorization T; — T?I = BT B; 
(Bi is an upper triangular matrix with positive diagonal.) 
Ti = BB? + 771 
i=it+l 
until convergence 


LR iteration is very similar in structure to QR iteration: We compute a 
factorization, and multiply the factors in reverse order to get the next iterate 
T;+1. It is easy to see that T;+ı and T; are similar: Tj,, = BBE + aed = 
Bo’ BT BB + 12B" BT = 3B TB. 

In fact, when the shift 7? = 0, we can show that two steps of LR iteration 
produce the same To as one step of QR iteration. 


LEMMA 5.6. Let Ty be the matrix produced by two steps of Algorithm 5.9 using 
T? = 0, and let T' be the matrix produced by one step of QR iteration (QR = To, 
T' = RQ). Then To = T. 


Proof. Since To is symmetric, we can factorize T? in two ways: First, T2 = 
TE To = (QR)TQR = RTR. We assume without loss of generality that Ri > 0. 
This is a factorization of T2 into a lower triangular matrix RT times its trans- 
pose; since the Cholesky factorization is unique, this must in fact be the 
Cholesky factorization. The second factorization is T = Bd BoB Bo. Now 
by Algorithm 5.9, Ti = Bo BE = BĪ B}, so we can rewrite T = BE BoBE Bo = 
Be (BEB,)Bo = (Bi Bo)’ Bi Bo. This is also a factorization of Tẹ into a 
lower triangular matrix (B;Bo)! times its transpose, so this must again be 
the Cholesky factorization. By uniqueness of the Cholesky factorization, we 
conclude R = B, Bo, thus relating two steps of LR iteration to one step of QR 
iteration. We exploit this relationship as follows: Tọ = QR implies 


T’ = RQ = RQ(RR!) = R(QR)R™t = RIR! because Ty = QR 


2ldqds is short for “differential quotient-difference algorithm with shifts” [207]. 
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= (B,Bo)(Bj Bo)(BiBo)~' because R = B,Bo and To = Bb Bo 
= B,BoBj BoBy' By! = Bi(BoBg) By" 

Bi (BY B,)By' because BoBA = Ti = BY By 

Bı BT 


= To as desired. 


Neither Algorithm 5.9 nor Lemma 5.6 depends on Tọ being tridiagonal, 
just symmetric positive definite. Using the relationship between LR iteration 
and QR iteration in Lemma 5.6, one can show that much of the convergence 
analysis of QR iteration goes over to LR iteration; we will not explore this 
here. 

Our ultimate algorithm, dqds, is mathematically equivalent to LR iteration. 
But it is not implemented as described in Algorithm 5.9, because this would 
involve explicitly forming Tj41 = BBE + pa , which in section 5.4 we showed 
could be numerically unstable. Instead, we will form Bi+ı directly from B,, 
without ever forming the intermediate matrix Tj,1. 

To simplify notation, let B; have diagonal aj,...,a@, and superdiagonal 
b1,..-,bn—1, and Bi+ı have diagonal @1,...,@, and superdiagonal bi, fax, Ds 
We use the convention bp bo bn bn 0. We relate B; to Bi+ı by 


Bh Bis + tal = Tin = BBY + rI. (5.20) 


Equating the j, j entries of the left and right sides of equation (5.20) for j < n 
yields 


â? + BF + Ta = a; + b? +7? or â? = as +b? — TA — 6, (5.21) 


where ô = T21 — 77. Since 7? must be chosen to approach the smallest eigen- 


value of T from below (to keep T; positive definite and the algorithm well 
defined), ô > 0. Equating the squares of the j, j+ 1 entries of the left and right 
sides of equation (5.20) yields 


= ab? or b7 = a3 05/45. (5.22) 
Combining equations (5.21) and (5.22) yields the not-yet-final algorithm 


for j7 =1ton-1 

a2 a21 72 _ 72 

a a 

bj = b5 + (541/45) 
end for 
a, = an — bp-1 — Ô 

This version of the algorithm has only five floating point operations in the 
inner loop, which is quite inexpensive. It maps directly from the squares of the 
entries of B; to the squares of the entries of Bi+ı1. There is no reason to take 
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square roots until the very end of the algorithm. Indeed, square roots, along 
with divisions, can take 10 to 30 times longer than additions, subtractions, or 
multiplications on modern computers, so we should avoid as many of them as 
possible. To emphasize that we are computing squares of entries, we change 
variables to qj = a; and ej = bi, yielding the penultimate algorithm qds (again, 
the name is for historical reasons that do not concern us [207]). 


ALGORITHM 5.10. One step of the qds algorithm: 


forj=l1lton—-1 
ĝj = qj + ej — êj-1 — Ô 
êj = ej - (qj+1/â;) 

end for 

Gn = qn — ên—-1 — 4 


The final algorithm, dqds, will do about the same amount of work as qds 
but will be significantly more accurate, as will be shown in section 5.4.2. We 
take the subexpression qj — €;-1 — ô from the first line of Algorithm 5.10 and 
rewrite it as follows: 


d; = qj — êj-1 — ô 
= q iiL LS from (5.22) 
qj-1 
Gj-1 — €j-1 
qj-1 
eer E 
= qj: [2 i : | ô from (5.21) 
qj-1 
Seg $8. 
qj-1 


This lets us rewrite the inner loop of Algorithm 5.10 as 


qj = dj + ej 

êj = ej + (441/45) 

dj41 = dj - (qj4i/G) — ô 

Finally, we note that dj+ı can overwrite dj and that t = qj+1/ĝ; need be 
computed only once to get the final dqds algorithm. 


ALGORITHM 5.11. One step of the dqds algorithm: 


d=q,—6 
forj=l1lton-1 
Gj =d+e; 


t = (qj41/%) 
êj = ej -t 
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d=d-t—6 
end for 
Gn = d 


The dqds algorithm has the same number of floating point operations in its 
inner loop as qds but trades a subtraction for a multiplication. This modifica- 
tion pays off handsomely in guaranteed high relative accuracy, as described in 
the next section. 

There are two important issues we have not discussed: choosing a shift 
ô =72,,—77 and detecting convergence. These are discussed in detail in [102]. 


5.4.2. Computing the Bidiagonal SVD to High Relative Accuracy 


This section, which depends on section 5.2.1, may be skipped on a first reading. 


Our ability to compute the SVD of a bidiagonal matrix B to high relative 
accuracy (as defined in section 5.2.1) depends on Theorem 5.13 below, which 
says that small relative changes in the entries of B cause only small relative 
changes in the singular values. 


LEMMA 5.7. Let B be a bidiagonal matriz, with diagonal entries a1,..., an 
and superdiagonal entries b,,...,b,-1. Let B be another bidiagonal matrix 


with diagonal entries a; = ajx; and superdiagonal entries bi = b;¢;. Then 


B = Dı BDə, where 


me (x X2X1 X3X2X1 aa) 
Ee E aG SA bees Sy a 


p= aie (1 sm C201 2 eo) 
X1 X2X1 Xn-1°°* X1 


The proof of this lemma is a simple computation (see Question 5.20). We 
can now apply Corollary 5.2 to conclude the following. 


THEOREM 5.13. Let B and B be defined as in Lemma 5.7. Suppose that there 
isa rT > 1 such that rT! < Xi < T and gol < G < T. In other words 
€=7-—1 is a bound on the relative difference between each entry of B and the 
corresponding entry of B. Let on <- <o be the singular values of B and 
Gn S++: < G1 be the singular values of B. Then lõi — o;| < o;(7*"-? — 1). If 
ci =0 andr —-1l=e <1, then we can write 
lê; = ail ae — 1 = (4n — Det Ole): 
Ti 

Thus, the relative change in the singular values |ô; — o;|/o; is bounded by 

4n—2 times the relative change € in the matrix entries. With a little more work, 
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the factor 4n — 2 can be improved to 2n — 1 (see Question 5.21). The singular 
vectors can also be shown to be determined quite accurately, proportional to 
the reciprocal of the relative gap, as defined in section 5.2.1. 

We will show that both Bisection (Algorithm 5.4 applied to Tps from 
Lemma 5.5) and dqds (Algorithm 5.11) can be used to find the singular values 
of a bidiagonal matrix to high relative accuracy. First we consider Bisection. 
Recall that the eigenvalues of the symmetric tridiagonal matrix Tps are the 
singular values of B and their negatives. Lemma 5.3 implies that the inertia of 
Tps — ÀI computed using equation (5.17) is the exact inertia of some B, where 
the relative difference of corresponding entries of B and B is at most about 
2.5€. Therefore, by Theorem 5.13, the relative difference between the com- 
puted singular values (the singular values of B ) and the true singular values is 
at most about (10n — 5)e. 

Now we consider Algorithm 5.11. We will use Theorem 5.13 to prove that 
the singular values of B (the input to Algorithm 5.11) and the singular values 
of B (the output from Algorithm 5.11) agree to high relative accuracy. This 
fact implies that after many steps of dqds, when Bis nearly diagonal with its 
singular values on the diagonal, these singular values match the singular values 
of the original input matrix to high relative accuracy. 

The simplest situation to understand is when the shift 6 = 0. In this case, 
the only operations in dqds are additions of positive numbers, multiplications, 
and divisions; no cancellation occurs. Roughly speaking, any sequence of ex- 
pressions built of these basic operations is guaranteed to compute each output 
to high relative accuracy. Therefore, Bis computed to high relative accuracy, 
and so by Theorem 5.13, the singular values of B and B agree to high relative 
accuracy. The general case, where 6 > 0, is trickier [102]. 


THEOREM 5.14. One step of Algorithm 5.11 in floating point arithmetic, ap- 
plied to B and yielding B, is equivalent to the following sequence of operations: 


1. Make a small relative change (by at most 1.5¢) in each entry of B, getting 
B. 


2. Apply one step of Algorithm 5.11 in exact arithmetic to B, getting B. 


3. Make a small relative change (by at most £) in each entry of B, getting 
B. 


Steps 1 and 3 above make only small relative changes in the singular values 
of the bidiagonal matriz, so by Theorem 5.13 the singular values of B and B 
agree to high relative accuracy. 


Proof. Let us write the inner loop of Algorithm 5.11 as follows, introducing 
subscripts on the d and t variables to let us keep track of them in different 
iterations and including subscripted 1 + € terms for the roundoff errors: 
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Gj = (dj + e7)(1 + 5,4) 

tj = (aj41/G)(1 + 6,7) 

êj = ej : t(l + €jx1) 

dj41 = (dj -tj(1 + 642) — 8) (1 + €3,—) 


Substituting the first line into the second line yields 


PERE. e a 

? dj+ej Lt+ej4 

Substituting this expression for tj into the last line of the algorithm and di- 
viding through by 1 + e; yield 


djy dain A+ 65) + x2) 


= ô. 5.23 
1+ej;- dj +e; 1+ €j,4 ee 
This tells us how to define B: Let 
7 dj+1 
d; = i ; 
j+1 I+ EL 
& ej 
TER 4 5.24 
4 tF Gei ( ) 
i J (1+ e€; ) 14649) 
41 = Qj4i ; 
: i 1+ €j,+ 
so (5.23) becomes , 
> djĝj+1 5 
Sa meee? = 
d; + ej 


Note from (5.24) that B differs from B by a relative change of at most 1.5¢ in 
each entry (from the three 1 + e factors in gj41 = Beggs): 
Now we can define g; and č; in B by 


qj = dj + ey 
ty = (G41/%), 
čj = ĉj ` tj, 


dj = jt) ô. 
This is one step of the dqds algorithm applied exactly to B, getting B. To 
finally show that B differs from B by a relative change of at most € in each 
entry, note that 


qj = a+; 
d; ej 
I + J 
tei LFE 


1 
(1 + €j )(1 + €j-1,-) 


(dj + ej)(1 + €;,+) > 


1 
(1+ €j,4)(1 + €j-1,-) 
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and 
Čj = é; -tj 
< ej Gti 
1+ €j-1,— dj 
ej 
= 1 -t;(lt+eno)(1+ 6-1 
ears (1 + €j,42)(1 + €j-1,-) 
1+ €),42 
= ejt;(14 Gel Ty. l 
Jyt 
á l+ €j x1 l 


5.4.3. Jacobi s Method for the SVD 


In section 5.3.5 we discussed Jacobi’s method for finding the eigenvalues and 
eigenvectors of a dense symmetric matrix A, and said it was the slowest avail- 
able method for this problem. In this section we will show how to apply 
Jacobi’s method to find the SVD of a dense matrix G by implicitly applying 
Algorithm 5.8 of section 5.3.5 to the symmetric matrix A = GTG. This implies 
that the convergence properties of this method are nearly the same as those 
of Algorithm 5.8, and in particular Jacobi’s method is also the slowest method 
available for the SVD. 

Jacobi’s method is still interesting, however, because for some kinds of 
matrices G, it can compute the singular values and singular vectors much 
more accurately than the other algorithms we have discussed. For these G, 
Jacobi’s method computes the singular values and singular vectors to high 
relative accuracy, as described in section 5.2.1. 

After describing the implicit Jacobi method for the SVD of G, we will 
show that it computes the SVD to high relative accuracy when G can be 
written in the form G = DX, where D is diagonal and X is well conditioned. 
(This means that G is ill conditioned if and only if D has both large and small 
diagonal entries.) More generally, we benefit as long as X is significantly better 
conditioned than G. We will illustrate this with a matrix where any algorithm 
involving reduction to bidiagonal form necessarily loses all significant digits in 
all but the largest singular value, whereas Jacobi computes all singular values 
to full machine precision. Then we survey other classes of matrices G for 
which Jacobi’s method is also significantly more accurate than methods using 
bidiagonalization. 

Note that if G is bidiagonal, then we showed in section 5.4.2 that we could 
use either Bisection or the dqds algorithm (section 5.4.1) to compute its SVD 
to high relative accuracy. The trouble is that reducing a matrix from dense 
to bidiagonal form can introduce errors that are large enough to destroy high 
relative accuracy, as our example will show. Since Jacobi’s method operates on 
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the original matrix without first reducing it to bidiagonal form, it can achieve 
high relative accuracy in many more situations. 

The implicit Jacobi method is mathematically equivalent to applying Al- 
gorithm 5.8 to A = GTG. In other words, at each step we compute a Jacobi 
rotation J and implicitly update GTG to J’G’GJ, where J is chosen so that 
two offdiagonal entries of GTG are set to zero in J'G?GJ. But instead of 
computing GTG or J'G’GJ explicitly, we instead only compute GJ. For this 
reason, we call our algorithm one-sided Jacobi rotation. 


ALGORITHM 5.12. Compute and apply a one-sided Jacobi rotation to G in 
coordinates j,k: 


proc One-Sided-Jacobi-Rotation (G, j, k) 
Compute aj = (GTG)jj, ajk = (G'G) jx, and Akk = (GT G)kk 
if |ajk| is not too small 
T = (ajj — akk) /(2 > ajr) 
t = sign(r)/(|7] + v1 +7?) 
c=1/V14+#? 


s=c:t 
G=G.- R(j,k,0) ... where c = cos 0 and s = sin 0 
if right singular vectors are desired 
J= J- Ry, ko) 
end if 
end if 


Note that the jj, jk, and kk entries of A = GTG are computed by proce- 
dure One-Sided-Jacobi-Rotation, after which it computes the Jacobi rotation 
R(j,k,0) in the same way as procedure Jacobi-Rotation (Algorithm 5.5). 


ALGORITHM 5.13. One-sided Jacobi: Assume that G is n-by-n. The outputs 
are the singular values o;, the left singular vector matrix U, and the right 
singular vector matriz V so that G = UV", where X = diag(a;). 


repeat 
forj=lton-1 
fork=j+lton 
call One-Sided-Jacobi-Rotation(G, j, k) 
end for 
end for 
until GTG is diagonal enough 
Let o;i = ||G(:,2)||2 (the 2-norm of column i of G) 
Let U = [u1,..., Un], where u; = GC, 1)/o; 
let V = J, the accumulated product of Jacobi rotations 
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Question 5.22 asks for a proof that the matrices X, U, and V computed by 
one-sided Jacobi do indeed form the SVD of G. 

The following theorem shows that one-sided Jacobi can compute the SVD to 
high relative accuracy, despite roundoff, provided that we can write G = DX, 
where D is diagonal and X is well conditioned. 


THEOREM 5.15. Let G = DX be an n-by-n matriz, where D is diagonal and 
nonsingular, and X is nonsingular. Let G be the matrix after calling One- 
Sided-Jacobi-Rotation(G, j,k) m times in floating point arithmetic. Let o1 > 
- > On be the singular values of G, and let 6, > --- > Gy be the singular 
values of G. Then 
mi < OlmaAklX), (5.25) 
Ti 
where K(X) = || XI| -I| XTt]| is the condition number of X. In other words, the 
relative error in the singular values is small if the condition number of X is 
small. 


Proof. We first consider m = 1; i.e., we apply only a single Jacobi rotation 
and later generalize to larger m. 

Examining One-Sided-Jacobi-Rotation(G, j, k), we see that G= A(G- R), 
where R is a floating point Givens rotation. By construction, R differs from 
some exact Givens rotation R by O(e) in norm. (It is not important or nec- 
essarily true that R differs by O(e) from the “true” Jacobi rotation, the one 
that One-Sided-Jacobi-Rotation(G, j,k) would have computed in exact arith- 
metic. It is necessary only that that it differs from some rotation by O(e). 
This requires only that c? + s? = 1+ O(c), which is easy to verify.) 

Our goal is to show that G = GR(I + E) for some E that is small in 
norm: ||Eljz = O(e)K(X). If E were zero, then G and GR would have the 
same singular values, since R is exactly orthogonal. When F is less than one 
in norm, we can use Corollary 5.2 to bound the relative difference in singular 
values by 


lo; — ĉil 


IA 


IG +E) +E) - Ill = |E + ET + EET |2 < 3||Ell2 
O(e)«(X) (5.26) 


Oi 


as desired. 

Now we construct E. Since R multiplies G on the right, each row of G 
depends only on the corresponding row of G; write this in Matlab notation as 
G(i,:) = A(G(i,:)- R). Let F = G—GR. Then by Lemma 3.1 and the fact 
that G= DX, 


IFG lle = IGG, :) — GG, )Rlle = O)||EG, :)ll2 = O(e)lldu X (i, :)ll2 


and so ||d;;'F(i,:)|l2 = O(e)||X(, olas or || D7 WS = O(e)||X |2. Therefore, 
since R7! = RT and G7} = (DX)! = XID}, 


Ĝ=GR+ F = GR(I + RTG!F) = GR(I + CEDi = GR(I + E) 
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where 
Ello < IR" ll XT] DF] = O(e)||X|]2I_X—*ll2 = Ole K(X) 


as desired. 

To extend this result to m > 1 rotations, note that in exact arithmetic 
we would have G = GR = DXR = DX, with (X) = K(X), so that the 
bound (5.26) would apply at each of the m steps, yielding bound (5.25). Be- 
cause of roundoff, K(X) could grow by as much as K(I+E) < (1+O(e)K(X)) at 
each step, a factor very close to 1, which we absorb into the O(me) 
term. 

To complete the algorithm, we need to be careful about the stopping cri- 
terion, i.e. how to implement the statement “if |a;,| is not too small” in 
Algorithm 5.12, One-Sided-Jacobi-Rotation. The appropriate criterion 


|ajk| = EVäjjäkk 


is discussed further in Question 5.24. 


EXAMPLE 5.9. We consider an extreme example G = DX where Jacobi’s 
method computes all singular values to full machine precision; any method 
relying on bidiagonalization computes only the largest one, v3, to full machine 
precision; and all the others with no accuracy at all (although it still computes 
them with errors +O(e) - V3, as expected from a backward stable algorithm). 
In this example € = 2753 ~ 10716 (IEEE double precision) and 7 = 1072 (any 
value of ņ < £ will do). We define 


yn 1 1 1 1 n 1 1 1 
-lnn a ae) ee 7 ee eee 
Sale ot Ge al) © n ro Uo |e 

n 0 0 7 n} [1 0 0 1 


To at least 16 digits, the singular values of G are /3, V3- n, 7, and n. To 
see how accuracy is lost by reducing G to bidiagonal form, we consider just 
the first step of the algorithm section 4.4.7: After step 1, premultiplication by 
a Householder transformation to zero out G(2 : 4,1), G in exact arithmetic 
would be 


2) -5-% -5-% -5-3 

0 -.5+ a 5-37 -5-7 

0 5-7 5+ on 5-7 |’ 
n n 5 2 

0 5-% -5-% -5+3 


but since 7 is so small, this rounds to 


A 5 -5 —5 


E Aigo N EN 
Ga e Saitek eek E 
0 -—5 —5 —.5 
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Note that all information about 7 has been “lost” from the last three columns 
of Gi. Since the last three columns of G, are identical, G4 is exactly singular 
and indeed of rank 2. Thus the two smallest singular values have been changed 
from 7 to 0, a complete loss of relative accuracy. If we made no further rounding 
errors, we would reduce G to the bidiagonal form 


with singular values /3, 31, 0, and 0, the larger two of which are accurate 
singular values of G. But as the algorithm proceeds to reduce G1 to bidiagonal 
form, roundoff introduces nonzero quantities of O(€) into the zero entries of B, 
making all three small singular values inaccurate. The two smallest nonzero 
computed singular values are accidents of roundoff and proportional to €. 

One-sided Jacobi’s method has no difficult with this matrix, converging in 
three sweeps to G = USV", where to machine precision 


2 1 St 

Cee FFR 

TE v 
gee eae ei ea 
|e Py S| gy eee 

1 7 mn 2 o a Le 

Z V 3 v V2 VB v6 


and © = diag(V/3n, n, V3,n). (Jacobi does not automatically sort the singular 
values; this can be done as a postprocessing step.) © 


Here are some other examples where versions of Jacobi’s method can be 
shown to guarantee high relative accuracy in the SVD (or symmetric eigen- 
decomposition), whereas methods relying on bidiagonalization (or tridiago- 
nalization) may lose all significant digits in the smallest singular value (or 
eigenvalues). Many other examples appear in [74]. 


1. If A = LL” is the Cholesky decomposition of a symmetric positive def- 
inite matrix, then the SVD of L = UNV" provides the eigendecompo- 
sition of A = UX?U7. If L = DX, where X is well-conditioned and D 
is diagonal, then Theorem 5.15 tells us that we can use Jacobi’s method 
to compute the singular values g; of L to high relative accuracy, with 
relative errors bounded by O(¢)K(X). But we also have to account for 
the roundoff errors in computing the Cholesky factor L: using Cholesky’s 
backward error bound (2.16) (along with Theorem 5.6) one can bound 
the relative error in the singular values introduced by roundoff during 
Cholesky by O(¢)K?(X). So if X is well-conditioned, all the eigenvalues 
of A will be computed to high relative accuracy (see Question 5.23 and 
[81, 90, 181]). 


254 


5.5. 


Applied Numerical Linear Algebra 


EXAMPLE 5.10. As in Example 5.9, we choose an extreme case where 
any algorithm relying on initially reducing A to tridiagonal form is guar- 
anteed to lose all relative accuracy in the smallest eigenvalue, whereas 
Cholesky followed by one-sided Jacobi’s method on the Cholesky factor 
computes all eigenvalues to nearly full machine precision. As in that 
example, let 7 = 107%? (any ņ < £/120 will do), and let 


1 yn vyn 1 107° 107" 
A=| y7 1 1% STO 1 10” 
Vn 10) 1007 1071? 10719 1072 


If we reduce A to tridiagonal form T exactly, then 


1 V 2N 
T = | y2) .5+60) .5-— 50 |, 
.5— 50) .5 +407 


but since 7 is so small, this rounds to 


1 V2 
T=] V27 5 6], 
0.0 


which is not even positive definite, since the bottom right 2-by-2 sub- 
matrix is exactly singular. Thus, the smallest eigenvalues of T is non- 
positive, and so tridiagonal reduction has lost all relative accuracy in 
the smallest eigenvalue. In contrast, one-sided Jacobi’s method has no 
trouble computing the correct square roots of eigenvalues of A, namely, 
1+7 =1+107", 1—,/7 = 1 — 107", and .997 = .99 - 107°, to nearly 
full machine precision. © 


. For extensions of the preceding result to indefinite symmetric eigenprob- 


lems, see [226, 248]. 


. For extensions to the generalized symmetric eigenproblem A — AB and 


the generalized SVD, see [65, 90]. 


Differential Equations and Eigenvalue Problems 


We seek our motivation for this section from conservation laws in physics. We 
consider once again the mass-spring system introduced in Example 4.1 and 
reexamined in Example 5.1. We start with the simplest case of one spring and 
one mass, without friction: 
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We let x denote horizontal displacement from equilibrium. Then Newton’s 
law F = ma becomes mi(t) + ka(t) = 0. Let E(t) = 5ma#?(t) + $ka?(t) = 
“kinetic energy” + “potential energy.” Conservation of energy tells us that 
4 E(t) should be zero. We can confirm this is true by computing 4E(t) = 
MECE) + ka(t)ae(t) = «(t)(mx(t) + kx(t)) = 0 as desired. 

More generally we have M#(t)+K x(t) = 0, where M is the mass matrix and 
K is the stiffness matriz. The energy is defined to be E(t) = 547 (t)Ma(t) + 
3x7 (t)K z(t). That this is the correct definition is confirmed by verifying that 
it is conserved: 


1 


= 5 (i (Milt) +T (t) M(t) + iT (t)K x(t) + 7 (t)K a(t) 


= T(t) Mèlt) +27 (t)K2(t) 
= «7 (t)(M#(t) + Kz(t)) = 0, 


where we have used the symmetry of M and K. 

The differential equations M(t) + Ka(t) = 0 are linear. It is a remarkable 
fact that some nonlinear differential equations also conserve quantities such as 
“energy.” 


5.5.1. The Toda Lattice 


For ease of notation, we will write t instead of z(t) when the argument is clear 
from context. 

The Toda lattice is also a mass-spring system, but the force from the spring 
is an exponentially decaying function of its stretch, instead of a linear function: 
ï= e7 (iT Bi-1) _ e7 (Ti+17 ti), 

We use the boundary conditions e~“!~*9) = 0 (i.e., £o = —00) and e7 (®n+1-2n) 
= 0 (i.e., &n41 = +00). More simply, these boundary conditions mean there 
are no walls at the left or right (see Figure 4.1). 

Now we change variables to bą = selte—tk+1)/2 and az, = — iip. This yields 
the differential equations 


: 1 T : 

b, = EREDA : 5 (te — £11) = bg (an41 — ax), 
| E M (5.27) 
Oe = — 5th = 2(b; — bk1) 
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with bọ = 0 and bn = 0. Now define the two tridiagonal matrices 


oe | a. | 


T= and B= ; 
bn—1 bn—1 
bn—1 an —bn-1 0 
where B = — BT. Then one can easily confirm that equation (5.27) is the same 


as a = BT — TB. This is called the Toda flow. 


THEOREM 5.16. T(t) has the same eigenvalues as T(0) for allt. In other 
words, the eigenvalues, such as “energy,” are conserved by the differential equa- 
tion. 


Proof. Define 4U = BU, U (0) = I. We claim that U(t) is orthogonal for all 
t. To prove this, it suffices to show ZUTU = 0 since UTU (0) = T: 


d 
dt 
since B is skew symmetric. 
Now we claim that T(t) = U(t)T(0)U7(t) satisfies the Toda flow 4 = 
BT — TB, implying each T(t) is orthogonally similar to T(0) and so has the 
same eigenvalues: 


STA = U)T(OUT( + VOTOU) 


UTU = ÙTU + UTU =U" BTU + U" BU = -UT BU + UBU = 0 


as desired. 
Note that the only property of B used was skew symmetry, so if iT -= 
BT — TB and B! = —B, then T(t) has the same eigenvalues for all t. 


THEOREM 5.17. Ast — +00 ort —> —co, T(t) converges to a diagonal matrix 
with the eigenvalues on the diagonal. 


Proof. We want to show b;(t) > 0 as t — too. We begin by showing 
JEn SI b? (t)dt < co. We use induction to show SFE) +b? _,(t))dt < œ 
and then add these inequalities for all j. When j = 0, we get f° (b(t) + 
b2(t))dt, which is 0 by assumption. 

Now let y(t) = a;(t) — a@n—j41(t). p(t) is bounded by 2|/T(t)||2 = 2||Z(0)|l2 
for all t. Then 


Q(t) = aj(t) — ån-j+1(t) 
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and so 


y(n) - 9-7) = f p(t)at 


z 2f BOROD- f (02_,(t) + By (t))alt. 


=T =T 


The last integral is bounded for all r by the induction hypothesis, and (T) — 
(p(—T) is also bounded for all 7, so JP, (07 (0) + b? _,(t))dt must be bounded as 
desired. 

Let p(t) = Yt] b2(t). We now know that J p(t)dt < œ, and since 
p(t) > 0 we want to conclude that limy—+.. p(t) = 0. But we need to exclude 
the possibility that p(t) has narrow spikes as t — too, in which case T p(t)dt 
could be finite without p(t) approaching 0. We show p(t) has no spikes by 
showing its derivative is bounded: 


Ip(t)| = 


n-1 
Ne 2b; (t)d;(t) 
i=1 


n-1 
NO 202()(aixi(t) — ai(t))] < 4(n — DITI. 
i=1 


Thus, in principle, one could use an ODE solver on the Toda flow to solve 
the eigenvalue problem, but this is no faster than other existing methods. The 
interest in the Toda flow lies in its close relationship with with QR algorithm. 


DEFINITION 5.5. Let X_ denote the strictly lower triangle of X, and 7o(X) = 
X_-X!, 


Note that 7o(X) is skew symmetric and that if X is already skew symmet- 
ric, then mp(X) = X. Thus ro projects onto skew symmetric matrices. 
Consider the differential equation 


d 

—T=BT-TB 5.28 

1 l (5.28) 
where B = —ro(F(T)) and F is any smooth function from the real numbers 


to the real numbers. Since B = —BT, Theorem 5.16 shows that T(t) has the 
same eigenvalues for all t. Choosing F(x) = x corresponds to the Toda flow 
that we just studied, since in this case 


es 0 


The next theorem relates the QR decomposition to the solution of differential 
equation (5.28). 
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THEOREM 5.18. Let F(T(0)) = Fo. Let e° = Q(t)R(t) be the QR decompo- 
sition. Then T(t) = Q?(t)T(0)Q(t) solves equation (5.28). 


We delay the proof of the theorem until later. If we choose the function F cor- 
rectly, it turns out that the iterates computed by QR iteration (Algorithm 4.4) 
are identical to the solutions of the differential equation. 


DEFINITION 5.6. Choosing F(x) = log x in equation (5.28) yields a differential 
equation called the QR flow. 


COROLLARY 5.3. Let F(x) = logx. Suppose that T(0) is positive definite, so 
log T(0) is real. Let To = T(0) = QR, Tı = RQ, etc. be the sequence of 
matrices produced by the unshifted QR iteration. Then T(i) = T;. Thus the 
QR algorithm gives solutions to the QR flow at integer times t.?? 


Proof of Corollary. At t = 1, we get sTo = Ty = Q(1)R(1), the QR 
decomposition of Ty, and T(1) = Q7(1)T)Q(1) = R(1)Q(1) = T, as desired. 
Since the solution of the ODE is unique, this extends to show T(i) = T; for 
larger i. 

The following figure illustrates this corollary graphically. The curve repre- 
sents the solution of the differential equation. The dots represent the solutions 
T(z) at the integer times t = 0,1,2,..., and indicates that they are equal to 
the QR iterates 7;. 


T(6)=T TO) =T; 


T=] T(2)=T, 


Proof of Theorem 5.18. Differentiate e*t? = QR to get 


Foe'fo = QR + QR 
orQ = Foe*R-1—~QRR-! 
or QTQ = QT Foe R! — RR! 
= Q'Fo(QR)R-!— RR because e? = QR 
= Q'F(T(0))Q—RR! because Fy = F(T(0)) 
= F(QTT(0)Q)- RR 
= F(T)- ÈR. 


22Note that since the QR decomposition is not completely unique (Q can be replaced by 
QS and R can be replaced by SR, where S is a diagonal matrix with diagonal entries +1), T; 
and T (i) could actually differ by a similarity T; = ST(i)S~'. For simplicity we will assume 
here, and in Corollary 5.4, that S has been chosen so that T; = T (i). 


The Symmetric Eigenproblem and SVD 259 


Now I = QTQ implies that 0 = £Q7Q = Q7Q+Q7Q = (Q7Q)"+(Q7Q). 
This means QTQ is skew symmetric, and so ™(Q7Q) = QTQ = m0(F(T) — 
RR’). Since RR7! is upper triangular, it doesn’t affect 79 and so finally 
QTQ = mo(F(T)). Now 


ira = Or | 
= Q™T(0)Q+ QTT(0Q . 
= QT(QQ™T(0)Q + QPT(0)(QQ")Q 
= QQT(t)+THQTQ 
= -~QTQT(t) + THQ 
= -no(F(P(#))T() + T(t)mo(F (L(t) 


as desired. 

The next corollary explains the phenomenon observed in Question 4.15, 
where QR could be made to “run backward” and return to its starting matrix. 
See also Question 5.25. 


COROLLARY 5.4. Suppose that we obtain T4 from the positive definite matrix 
To by the following steps: 


1. Dom steps of the unshifted QR algorithm on To to get T. 


2. Let Th = “flipped Tı” = JT,J, where J equals the identity matrix with 
its columns in reverse order. 


3. Dom steps of unshifted QR on Ts to get T3. 
4. Let Ty = JT3J. 


Then T4 = To. 
Proof. If X = XT, it is easy to verify that m(JXJ) = —Jro(X)J so 
T;(t) = JT(t)J satisfies 
alt) = JTH 
= ttn o(f(T))T + Tro(F(L))] J 
= —Jno(F(T))J(JTI) + (JTS) Jno(F(L))J since J? = I 
= no(JF(T)J)Ty — Tyro(JF(T)J) 
= m(FUJTI))Ty — Ty7o(F(JTI)) 
= rol (T3))Ty — Tymo(F(T7)). 


This is nearly the same equation as T(t). In fact, it satisfies exactly the same 
equation as T(—t): 


T(t) = -ET = — olEF (TT + ProlF(P))|_4- 


So with the same initial conditions T>, T;(t) and T(—t) must be equal. Inte- 
grating for time m, T(—t) takes Tə = JT, J back to JToJ, the initial state, so 
Ts = JToJ and T4 = JT3J = To as desired. 
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5.5.2. The Connection to Partial Differential Equations 


This section may be skipped on a first reading. 
2 3 
Let T(t) = -2 + q(z,t) and B(t) = 425 + 3(¢(2,t)2 + qlz, t)). 
Both T(t) and B(t) are linear operators on functions, i.e., generalizations of 
matrices. 
Substituting into 4 = BT — TB yields 


qt = 6dr — Arex, (5.29) 


provided that we choose the correct boundary conditions for q. (B must be 
skew symmetric and T symmetric.) Equation (5.29) is called the Korteweg- 
de Vries equation and describes water flow in a shallow channel. One can 
rigorously show that (5.29) preserves the eigenvalues of T(t) for all ¢ in the 
sense that the ODE 


es + ale, t) ) x) = Mile) 


has some infinite set of eigenvalues 41, A2,... for all t. In other words, there 
is an infinite sequence of energylike quantities conserved by the Korteweg-de 
Vries equation. This is important for both theoretical and numerical reasons. 


For more details on the Toda flow, see [142, 168, 66, 67, 237] and papers 
by Kruskal [164], Flaschka [104], and Moser [185] in [186]. 


5.6. References and Other Topics for Chapter 5 


An excellent general reference for the symmetric eigenproblem is [195]. The 
material on relative perturbation theory can be found in [74, 81, 99]; sec- 
tion 5.2.1 was based on the latter of these references. Related work is found 
in [65, 90, 226, 248] A classical text on perturbation theory for general linear 
operators is [159]. For a survey of parallel algorithms for the symmetric eigen- 
problem, see [75]. The QR algorithm for finding the SVD of bidiagonal matrices 
is discussed in [79, 66, 118], and the dqds algorithm is in [102, 198, 207]. For 
an error analysis of the Bisection algorithm, see [72, 73, 154], and for recent 
attempts to accelerate Bisection see [103, 201, 199, 174, 171, 173, 267]. Current 
work in improving inverse iteration appears in [103, 199, 201]. The divide-and- 
conquer eigenroutine was introduced in [58] and further developed in [13, 88, 
125, 129, 151, 170, 208, 232]. The possibility of high-accuracy eigenvalues ob- 
tained from Jacobi is discussed in [65, 74, 81, 90, 181, 226]. The Toda flow and 
related phenomena are discussed in [66, 67, 104, 142, 164, 168, 185, 186, 237]. 
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5.7. Questions for Chapter 5 


QUESTION 5.1. (Easy; Z. Bai) Show that A = B+:C is Hermitian if and only 
if 
pes | B -C 


C B 


is symmetric. Express the eigenvalues and eigenvectors of M in terms of those 
of A. 


QUESTION 5.2. (Medium) Prove Corollary 5.1, using Weyl’s theorem (Theo- 
rem 5.1) and part 4 of Theorem 3.3. 


QUESTION 5.3. (Medium) Consider Figure 5.1. Consider the corresponding 
contour plot for an arbitrary 3-by-3 matrix A with eigenvalues ag < ag < ay. 
Let Cı and C2 be the two great circles along which p(u, A) = ag. At what 
angle do they intersect? 


QUESTION 5.4. (Hard) Use the Courant—Fischer minimax theorem (Theorem 
5.2) to prove the Cauchy interlace theorem: 


H 
bT 
(n — 1)-by-(n — 1). Let an < --- < ay be the eigenvalues of A and 
On—-1 < +++ < 0, be the eigenvalues of H. Show that these two sets of 
eigenvalues interlace: 


e Suppose that A = | , ] is an n-by-n symmetric matrix and H is 


An < On-1 <- < bi < ai < bi—-1 < Qi-1 < --- < 01 < a1. 


e Let A=[ pr 


--- < 1. Show that the eigenvalues of A and H interlace in the sense 
that @;4(n-m) < 9; < aj (or equivalently aj < 0j—(n-m) < @j-(n—m))- 


$ ] be n-by-n and H be m-by-m, with eigenvalues ĝm < 


QUESTION 5.5. (Medium) Let A = AT with eigenvalues a > --: > an. Let 
H = H" with eigenvalues 0; > --- > n. Let A + H have eigenvalues à > 
++» > An. Use the Courant-Fischer minimax theorem (Theorem 5.2) to show 
that aj + On < Aj < aj +1. If H is positive definite, conclude that A; > aj. 
In other words, adding a symmetric positive definite matrix H to another 
symmetric matrix A can only increase its eigenvalues. 

This result will be used in the proof of Theorem 7.1. 


QUESTION 5.6. (Medium) Let A = [A; , Ag] be n-by-n, where A, is n-by-m 
and Ag is n-by-(n — m). Let 01 > --- > on be the singular values of A 
Ti > +++ > Tm be the singular values of A;. Use the Cauchy interlace theorem 
from Question 5.4 and part 4 of Theorem 3.3 to prove that oj > Tj > Oj4n—m- 
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QUESTION 5.7. (Medium) Let q be a unit vector and d be any vector orthog- 
onal to g. Show that ||(q + d)g? — I||2 = ||q+dllz. (This result is used in the 
proof of Theorem 5.4.) 


QUESTION 5.8. (Hard) Formulate and prove a theorem for singular vectors 
analogous to Theorem 5.4. 


QUESTION 5.9. (Hard) Prove bound (5.6) from Theorem 5.5. 
QUESTION 5.10. (Harder) Prove bound (5.7) from Theorem 5.5. 


QUESTION 5.11. (Easy) Suppose 6 = 6;+62, where all three angles lie between 
0 and 7/2. Prove that 5 sin 20 < $ sin 20, + 5 sin 205. This result is used in 
the proof of Theorem 5.7. 


QUESTION 5.12. (Hard) Prove Corollary 5.2. Hint: Use part 4 of Theorem 3.3. 


QUESTION 5.13. (Medium) Let A be a symmetric matrix. Consider running 
shifted QR iteration (Algorithm 4.5) with a Rayleigh quotient shift (o; = ann) 


at every iteration, yielding a sequence oj,09,... of shifts. Also run Rayleigh 
quotient iteration (Algorithm 5.1), starting with xq = [0,...,0,1]”, yielding 
a sequence of Rayleigh quotients 1, p2,.... Show that these sequences are 


identical: o; = p; for all i. This justifies the claim in section 5.3.2 that shifted 
QR iteration enjoys local cubic convergence. 


QUESTION 5.14. (Easy) Prove Lemma 5.1. 


QUESTION 5.15. (Easy) Prove that if t(n) = 2t(n/2) + en? + O(n?), then 
t(n) = cn. This justifies the complexity analysis of the divide-and-conquer 
algorithm (Algorithm 5.2). 


QUESTION 5.16. (Easy) Let A = D + puu? , where D = diag(d1,...,dn) and 
u = [uy,...,Un]’. Show that if d; = dj41 or u; = 0, then d; is an eigenvalue 
of A. If u; = 0, show that the eigenvector corresponding to d; is e;, the 
ith column of the identity matrix. Derive a similarly simple expression when 
di = di41. This shows how to handle deflation in the divide-and-conquer 
algorithm, Algorithm 5.2. 


QUESTION 5.17. (Easy) Let p and Y’ be given scalars. Show how to compute 
scalars c and ĉ in the function definition h(A) = ê+ g% so that at A = €, 
h(£) =u, and h’(€) = y. This result is needed to derive the secular equation 
solver in section 5.3.3. 


QUESTION 5.18. (Easy; Z. Bai) Use the SVD to show that if A is an m- 
by-n real matrix with m > n, then there exists an m-by-n matrix Q with 
orthonormal columns (QTQ = I) and an n-by-n positive semidefinite matrix P 
such that A = QP. This decomposition is called the polar decomposition of A, 
because it is analogous to the polar form of a complex number z = e?"8) . |z].) 
Show that if A is nonsingular, then the polar decomposition is unique. 
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QUESTION 5.19. (Easy) Prove Lemma 5.5. 
QUESTION 5.20. (Easy) Prove Lemma 5.7. 


QUESTION 5.21. (Hard) Prove Theorem 5.13. Also, reduce the exponent 4n — 
2 in Theorem 5.13 to 2n — 1. Hint: In Lemma 5.7, multiply Dı and divide Də 
by an appropriately chosen constant. 


QUESTION 5.22. (Medium) Prove that Algorithm 5.13 computes the SVD of 
G, assuming that GTG converges to a diagonal matrix. 


QUESTION 5.23. (Harder) Let A be an n-by-n symmetric positive definite 
matrix with Cholesky decomposition A = LLT, and let Ê be the Cholesky fac- 
tor computed in floating point arithmetic. In this question we will bound 
the relative error in the (squared) singular values of L as approximations 
of the eigenvalues of A. Show that A can be written A = DAD, where 
D= diag(at!?, te eee) and āu = 1 for all i. Write L = DX. Show that 
K?(X) = «(A). Using bound (2.16) for the backward error 6A of Cholesky 
A+6A = LL", show that one can write ÊT Ê = YT LT LY, where ||Y7Y—I||2 < 
O(e)K(A). Use Theorem 5.6 to conclude that the eigenvalues of ÊTÊ and 
of LTL differ relatively by at most O(e)«(A). Then show that this is also 
true of the eigenvalues of LL’ and LLT. This means that the squares of 
the singular values of L differ relatively from the eigenvalues of A by at most 


O(e)«(A) = O(e)K2(L). 


QUESTION 5.24. (Harder) This question justifies the stopping criterion for 
one-sided Jacobi’s method for the SVD (Algorithm 5.13). Let A = GTG, 
where G and A are n-by-n. Suppose that |ajx| < €\/@jj@xx for all j = k. Let 
On < +++ < oy be the singular values of G, and a? L. < a? be the sorted 
diagonal entries of A. Prove that |o; — a;| < neļa;| so that the a; equal the 
singular values to high relative accuracy. Hint: Use Corollary 5.2. 


QUESTION 5.25. (Harder) In Question 4.15, you “noticed” that running QR 
for m steps on a symmetric matrix, “flipping” the rows and columns, running 
for another m steps, and flipping again got you back to the original matrix. 
(Flipping X means replacing X by JX J, where J is the identity matrix with 
its row in reverse order.) In this exercise we will prove this for symmetric 
positive definite matrices T using an approach different from Corollary 5.4. 

Consider LR iteration (Algorithm 5.9) with a zero shift, applied to the 
symmetric positive definite matrix T (which is not necessarily tridiagonal): 
Let T = To = Bi Bo be the Cholesky decomposition, Ti = BoB = BT Bi, 
and more generally T; = Bi- BE = BT B;. Let T. denote the matrix obtained 
from To after i steps of unshifted QR iteration; i.e., if T, = Q;R; is the QR 
decomposition, then ns = R;Q;. In Lemma 5.6 we showed that f, = T);; 
i.e., one step of QR is the same as two steps of LR. 
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1. Show that T; = (Bi-1Bi-2 tee Bo)~?To(Bi-1Bi-2 tee Bo)! . 
2. Show that T; = (Bi-1Bi-2 os Bo)To(Bi-1 Bi_-2 tet Bo)7t. 


3. Show that Ti = (B;B;-1--- Bo)! (BiBi—1 -+ Bo) is the Cholesky decom- 
position of To. 


4. Show that T = (Qo tee Qi-2Qi-1) : (Ri-1Ri-2 tee Ro) is the QR decom- 
position of T: 


5. Show that Te = (Roi-1Roi-2 ey Ro)! (Rzi—1 Rzi—2 oes Ro) is the 
Cholesky decomposition of Tar, 


6. Show the result after m steps of QR, flipping m steps of QR, and flipping, 
is the same as the original matrix. Hint: Use the fact that the Cholesky 
factorization is unique. 


QUESTION 5.26. (Hard; Z. Bai) Suppose that x is an n-vector. Define the 
matrix C by cij = |xi|+|z;|—|xr;—z;|. Show that C(x) is positive semidefinite. 


QUESTION 5.27. (Easy; Z. Bai) Let 
I B 
a=( a0 7) 


1+ |[Blle 
1— |[Blle 


with ||Bl|2 < 1. Show that 
|All2|| A" le = 
QUESTION 5.28. (Medium; Z. Bai) A square matrix A is said to be skew 
Hermitian if A* = —A. Prove that 
1. the eigenvalues of a skew Hermitian are purely imaginary. 


2. I — A is nonsingular. 


3. C = (I — A)T} (I + A) is unitary. C is called the Cayley transform of A. 
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Iterative Methods for Linear Systems 


6.1. Introduction 


Iterative algorithms for solving Ax = b are used when methods such as Gaus- 
sian elimination require too much time or too much space. Methods such 
Gaussian elimination, which compute the exact answers after a finite number 
of steps (in the absence of roundoff!), are called direct methods. In contrast to 
direct methods, iterative methods generally do not produce the exact answer 
after a finite number of steps but decrease the error by some fraction after 
each step. Iteration ceases when the error is less than a user-supplied thresh- 
old. The final error depends on how many iterations one does as well as on 
properties of the method and the linear system. Our overall goal is to develop 
methods which decrease the error by a large amount at each iteration and do 
as little work per iteration as possible. 

Much of the activity in this field involves exploiting the underlying math- 
ematical or physical problem that gives rise to the linear system in order to 
design better iterative methods. The underlying problems are often finite 
difference or finite element models of physical systems, usually involving a 
differential equation. There are many kinds of physical systems, differential 
equations, and finite difference and finite element models, and so many meth- 
ods. We cannot hope to cover all or even most interesting situations, so we 
will limit ourselves to a model problem, the standard finite difference approx- 
imation to Poisson’s equation on a square. Poisson’s equation and its close 
relation, Laplace’s equation, arise in many applications, including electromag- 
netics, fluid mechanics, heat flow, diffusion, and quantum mechanics, to name 
a few. In addition to describing how each method works on Poisson’s equation, 
we will indicate how generally applicable it is, and describe common variations. 

The rest of this chapter is organized as follows. Section 6.2 describes on-line 
help and software for iterative methods discussed in this chapter. Section 6.3 
describes the formulation of the model problem in detail. Section 6.4 summa- 
rizes and compares the performance of (nearly) all the iterative methods in 
this chapter for solving the model problem. 
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The next five sections describe methods in roughly increasing order of their 
effectiveness on the model problem. Section 6.5 describes the most basic it- 
erative methods: Jacobi, Gauss-Seidel, successive overrelaxation, and their 
variations. Section 6.6 describes Krylov subspace methods, concentrating on 
the conjugate gradient method. Section 6.7 describes the fast Fourier trans- 
form and how to use it to solve the model problem. Section 6.8 describes block 
cyclic reduction. Finally, section 6.9 discusses multigrid, our fastest algorithm 
for the model problem. Multigrid requires only O(1) work per unknown, which 
is optimal. 

Section 6.10 describes domain decomposition, a family of techniques for 
combining the simpler methods described in earlier sections to solve more com- 
plicated problems than the model problem. 


6.2. On-line Help for Iterative Methods 


For Poisson’s equation, there will be a short list of numerical methods that 
are clearly superior to all the others we discuss. But for other linear systems 
it is not always clear which method is best (which is why we talk about so 
many!). To help users select the best method for solving their linear systems 
among the many available, on-line help is available at NETLIB/templates. 
This directory contains a short book [24] and software for most of the it- 
erative methods discussed in this chapter. The book is available in both 
PostScript (NETLIB/templates/templates.ps) and Hypertext Markup Lan- 
guage (NETLIB/templates/template.html). The software is available in Mat- 
lab, Fortran, and C++. 


The word template is used to describe this book and the software, because 
the implementations separate the details of matrix representations from the 
algorithm itself. In particular, the Krylov subspace methods (see section 6.6) 
require only the ability to multiply the matrix A by an arbitrary vector z. The 
best way to do this depends on how A is represented but does not otherwise 
affect the organization of the algorithm. In other words, matrix-vector multi- 
plication is a “black-box” called by the template. It is the user’s responsibility 
to supply an implementation of this black-box. 

An analogous templates project for eigenvalue problems is underway. Other 
recent textbooks on iterative methods are [15, 134, 212). 

For the most challenging practical problems arising from differential equa- 
tions more challenging than our model problem, the linear system Ax = b must 
be “preconditioned,” or replaced with the equivalent systems M~!Ax = M~!b, 
which is somehow easier to solve. This is discussed at length in sections 6.6.5 
and 6.10. Implementations, including parallel ones, of many of these techniques 
are available on-line in the package PETSc, or Portable Extensible Toolkit for 
Scientific computing, at http: //www.mcs.anl.gov/petsc/petsc.html [230]. 
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6.3. Poisson’s Equation 


6.3.1. Poisson’s Equation in One Dimension 


We begin with a one-dimensional version of Poisson’s equation, 


= f(x), eee (6.1) 


where f(x) is a given function and v(x) is the unknown function that we want 
to compute. v(x) must also satisfy the boundary conditions?’ v(0) = v(1) = 0. 
We discretize the problem by trying to compute an approximate solution at 
N +2 evenly spaced points x; between 0 and 1: x; = ih, where h = vA 
and 0 <i< N+1. We abbreviate v; = v(z;) and fi = f(x). To convert 


differential equation (6.1) into a linear equation for the unknowns v1,..., UN, 
we use finite differences to approximate 
dvu(x) Vi — Vi-1 
dx x=(i—.5)h h 
du(x) ~ laoi 
dx a=(i+.5)h h ` 


Subtracting these approximations and dividing by h yield the centered differ- 
ence approximation 
d?u(x) 
dx? 


2U; — Vi-1 — Vi41 
mee Ta : Ti, (6.2) 


P=; 


where 7;, the so-called truncation error, can be shown to be O(h? - jesi: 
We may now rewrite equation (6.1) at £ = 2; as 


—Uj-1 + 20; — Vig = h? fi + hti, 


where 0 <i < N+1. Since the boundary conditions imply that vo = vy+1 = 0, 
we have N equations in N unknowns 1,...,UN: 


w A To 
a ie 


| 
m 
j=) 
l 
pa 
w l 
=. 
ba 
m 
eQ 
z 
UT. 


I 
= 
N 
+ 
a 
N 
— 
o 
S 


fn TN 


23These are called Dirichlet boundary conditions. Other kinds of boundary conditions are 
also possible. 
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Fig. 6.1. Eigenvalues of Toi. 


or 


Tyv = h° f +h’°7. (6.4) 


To solve this equation, we will ignore 7, since it is small compared to f, to 
get 
Tyô= h’ f. (6.5) 


(We bound the error v — @ later.) 

The coefficient matrix Ty plays a central role in all that follows, so we will 
examine it in some detail. First, we will compute its eigenvalues and eigen- 
vectors. One can easily use trigonometric identities to confirm the following 
lemma (see Question 6.1). 


LEMMA 6.1. The eigenvalues of Ty are Aj = 2(1—cos xh). The eigenvectors 
are zj, where z;(k) = \/ wer sin(jka/(N +1)). z; has unit two-norm. Let 


Z = |21,.--,2n] be the orthogonal matriz whose columns are the eigenvectors, 
and A = diag(Aj,...,An), so we can write Ty = ZAZI. 


Figure 6.1 is a plot of the eigenvalues of Ty for N = 21. 
The largest eigenvalue is Ay = 2(1 — cos TH) ~ 4. The smallest 
eigenvalue?‘ is Àq, where for small i 


iT er iT 2 
ee ie ee (cee bal = 
i ( cos) ( ( 7 )) (5) 


24Note that Ax is the largest eigenvalue and Àı is the smallest eigenvalue, the opposite of 
the convention of Chapter 5. 
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Fig. 6.2. Eigenvectors of Toi. 


Thus Ty is positive definite with condition number Ay /à1 % 4(N + 1)? /n? for 
large N. The eigenvectors are sinusoids with lowest frequency at 7 = 1 and 
highest at 7 = N, shown in Figure 6.2 for N = 21. 

Now we know enough to bound the error, i.e., the difference between the 
solution of Tyô = h?f and the true solution v of the differential equation: 
Subtract equation (6.5) from equation (6.4) to get v — ô = h?Th'7. Taking 
norms yields 

J 


so the error v — 0 goes to zero proportionally to h?, provided that the solution 
is smooth enough. (| Flo is bounded.) 


(N +1)? 


lv — ôll2 < ITR allt lle = R= IF = ONF) = O (1 


T dx* 


From now on we will not distinguish between v and its approximation 0, 
and so will simplify notation by letting Tyv = h? f. 

In addition to the solution of the linear system h~?Tyv = f approximating 
the solution of the differential equation (6.1), it turns out that the eigenvalues 
and eigenvectors of h~?T'y also approximate the eigenvalues and eigenfunctions 
of the differential equation: We say that į; is an eigenvalue and 2;(a) is an 
eigenfunction of the differential equation if 
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Let us solve for \; and 4;(2): It is easy to see that 2;(a) must equal a sin(V\ix)+ 
Bcos( y Aiz) for some constants œa and 8. The boundary condition 2;(0) = 0 
implies 6 = 0, and the boundary condition 2;(1) = 0 implies that Vi is 
an integer multiple of 7, which we can take to be iz. Thus X; = in? and 
2;(~) = asin(ivx) for any nonzero constant a (which we can set to 1). Thus 
the eigenvector z; is precisely equal to the eigenfunction ĉ;(x) evaluated at the 
272 


sample points x; = jh (when scaled by ). And when å is small, = 1 


NHI 
is well approximated by h~?-\; = (N+1)?-2(1—cos Na) = i?r? +O((N+1)7?). 
Thus we see there is a close correspondence between Ty (or h~?Ty) and the 
second derivative operator Si This correspondence will be the motivation 
for the design and analysis of later algorithms. 
It is also possible to write down simple formulas for the Cholesky and LU 


factors of Ty; see Question 6.2 for details. 


6.3.2. Poisson’s Equation in Two Dimensions 


Now we turn to Poisson’s equation in two dimensions: 


Pv(x,y) _ v(x, y) 


T aye = Flea) (6.6) 


on the unit square {(z,y) : 0 < x,y < 1}, with boundary condition v = 0 
on the boundary of the square. We discretize at the grid points in the square 
which are at (xj, yj) with z; = ih and y; = jh, with h = War We abbreviate 
vij = v(ih, jh) and fi; = f(ih, jh), as shown below for N = 3: 


i= i=l i=2 is3 is 


From equation (6.2), we know that we can approximate 


82 , W 5 — U1 — U; . 
a ae y) a e 1 Pitti and (6.7) 
T=Ti, Y5Yj 
v(x, y) O 2Uij— Vij- — Vij+l a 
a z L . (6.8) 


THL4,Y=Yj 
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Adding these approximations lets us write 


Bulz, y)  v(zx,y) 


2 2 
Ox Oy L=Li,Y=Yj 
AHH A ts Us 5 
_ J i—1,j i+l,j t, j—1 t, j+1 
= he Tij, (6.9) 


where 7;; is again a truncation error bounded by O(h?). The heavy (blue) 
cross in the middle of the above figure is called the (5-point) stencil of this 
equation, because it connects all (5) values of v present in equation (6.9). 
From the boundary conditions we know voj = UN+1,j = Vi = Vi,v+1 = 0 so 
that equation (6.9) defines a set of n = N? linear equations in the n unknowns 
Vij for 1 <ij <N: 


2 
Avig — Vi-1,j = Vilj — Vij- — Vij = h fij. (6.10) 


There are two ways to rewrite the n equations represented by (6.10) as a 
single matrix equation, both of which we will use later. 

The first way is to think of the unknowns v;j as occupying an N-by-N 
matrix V with entries v;; and the right-hand sides h? fij as similarly occupying 
an N-by-N matrix h? F. The trick is to write the matrix with i,j entry 4vij — 
Ui-1,j — Vit1,j — Vi,j-1 — Vi j+1 in a simple way in terms of V and Ty: Simply 
note that 


2vij — Vig = vizij = (Tn Vaz, 


2vij — vij-1 = Vg = (V: Ty )ag, 


so adding these two equations yields 


(Tn -V+tV > Ty)ag = 40ig — tea — Yigg — Vga a = A? fig = (WP iy 


or 
We VV -Ty =hF. (6.11) 
This is a linear system of equations for the unknown entries of the matrix V, 
even though it is not written in the usual “Ax = b” format, with the unknowns 
forming a vector x. (We will write the “Az = b” format below.) Still, it 
is enough to tell us what the eigenvalues and eigenvectors of the underlying 
matrix A are, because “Ax = Ax” is the same as “TNV + VTy = AV.” Now 
suppose ge Ty % = 42% and Tyz; = Ajzj are any two eigenpairs of Ty, and 
let V = žiži . Then 


TnV + VTN Ty2i)2? + zi(z; Tn) 

Ài A Po Aj) 

Ai AG) Rez 

Ai + A;)V, (6.12) 


( 
( 
( 
( 
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so V = zizi 
N? entries, we expect N? eigenvalues and eigenvectors, one for each pair of 
eigenvalues A; and A; of Ty. In particular, the smallest eigenvalue is 2A] and 
the largest eigenvalue is 2Ay, so the condition number is the same as in the 
one-dimensional case. We rederive this result below using the “Az = b” format. 
See Figure 6.3 for plots of some eigenvectors, represented as surfaces defined 
by the matrix entries of zizi » 


is an “eigenvector” and A; + A; is an eigenvalue. Since V has 


Just as the eigenvalues and eigenvectors of h~?Ty were good approxima- 
tions to the eigenvalues and eigenfunctions of one-dimensional Poisson’s equa- 
tion, the same is true of two-dimensional Poisson’s equation, whose eigenvalues 
and eigenfunctions are as follows (see Question 6.3): 


o? o? 
—-— — => } sin(ins)sin(jr 
(~ oa - Fp) sini) sino) 
= (iPr? + 5727) sin(irz) sin(jry). (6.13) 
The second way to write the n equations represented by equation (6.10) 
as a single matrix equation is to write the unknowns v;; in a single long N 2 
by-1 vector. This requires us to choose an order for them, and we (somewhat 


arbitrarily) choose to number them as shown in Figure 6.4, columnwise from 
the upper left to the lower right. 


For example, when N = 3 one gets a column vector v = [v1,...,v9]/. If 
we number f accordingly, we can transform equation (6.10) to get 
T 4 -1 —1 7 
i -1 4 -1 = : 
2 ai 1 U2 
| zI 4 i z] : 
T3x3° = —1 —1 4 -1 —1 . 
| | | -1 -1 4 CANE | 
=i res 
| -1 = ee es | 
v9 =i —1 v9 
fi 


| 
| (6.14) 
| 
| 


The —1’s immediately next to the diagonal correspond to subtracting the 
top and bottom neighbors —v;_1,; — vi4i,j- The —1’s farther away away from 
the diagonal correspond to subtracting the left and right neighbors —v;i j—1 — 
vij+1. For general N, we confirm in the next section that we get an N 2_by-N? 
linear system 

Tvxn v =h°f, (6.15) 
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Fig. 6.3. Three-dimensional and contour plots of first four eigenvectors of the 10-by-10 
Poisson equation. 
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Fig. 6.4. Numbering the unknowns in Poisson’s equation. 


where Tyxyn has N N-by-N blocks of the form Ty + 2Iyn on its diagonal and 
—Iy blocks on its offdiagonals: 
Tn +2In —In 


—In 


TNxN = (6.16) 


= —In 
—In Ty + 2In 
6.3.3. Expressing Poisson’s Equation with Kronecker Products 
Here is a systematic way to derive equations (6.15) and (6.16) as well as to 


compute the eigenvalues and eigenvectors of Tyxn. The method works equally 
well for Poisson’s equation in three or more dimensions. 


DEFINITION 6.1. Let X be m-by-n. Then vec(X) is defined to be a column 
vector of size m-n made of the columns of X stacked atop one another from 
left to right. 


Note that N?-by-1 vector v defined in Figure 6.4 can also be written v = 
vec(V). 

To express Tyxy as well as compute its eigenvalues and eigenvectors, we 
need to introduce Kronecker products. 


DEFINITION 6.2. Let A be an m-by-n matrix and B be a p-by-q matrix. Then 
AQ B, the Kronecker product of A and B, is the (m - p)-by-(n - q) matriz 


aii: B xd an: B 


amı: B A am n: B 


The following lemma tells us how to rewrite the Poisson equation in terms 
of Kronecker products and the vec(-) operator. 
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LEMMA 6.2. Let A be m-by-m, B be n-by-n, and X and C be m-by-n. Then 
the following properties hold: 

1. vec(AX) = (In 8 A) - vec(X). 

2. vec(X B) = (BT 8 Im) - vec(X). 

3. The Poisson equation TyV + VTy = h?F is equivalent to 


Tnxn ` vec(V) = (Iy ® Ty + Ty @ In) - vec(V) = h?vec(F). (6.17) 


Proof. We prove only part 3, leaving the other parts to Question 6.4. We start 
with the Poisson equation TyV + VTy = h?F as expressed in equation (6.11), 
which is clearly equivalent to 


vec(TwV + VTy) = vec(TnV) + vec(VTy) = vec(h?F). 
By part 1 of the lemma 
vec(TwV) = (In ® Tw)vec(V). 
By part 2 of the lemma and the symmetry of Ty, 


vec(VTy) = (TF & In)vec(V) = (Ty @ In)vec(V). 


Adding the last two expressions completes the proof of part 3. 
The reader can confirm that the expression 


Tuxn = In @®@Tnwt+Ty @In 
ee | ha —In | 


Tn -Iy 2Iy 


from equation (6.17) agrees with equation (6.16).?° 
To compute the eigenvalues of matrices defined by Kronecker products, like 
TvxNn, we need the following lemma, whose proof is also part of Question 6.4. 


LEMMA 6.3. The following facts about Kronecker products hold: 


1. Assume that the products A -C and B - D are well defined. Then (A ® 
B)-(C@D)=(A-C)@(B-D). 


25We can use this formula to compute Tynxn in two lines of Matlab: 


TN = 2*eye(N) - diag(ones(N-1,1),1) - diag(ones(N-1,1),-1); 
TNXN = kron(eye(N),TN) + kron(TN,eye(N)); 
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2. If A and B are invertible, then (A@B)-'=A'@B!. 


3. (A@B)’ = AT & BT. 


PROPOSITION 6.1. Let Ty = ZAZ" be the eigendecomposition of Ty, with 
Z = [z1,..., zy] the orthogonal matriz whose columns are eigenvectors, and 
A = diag(à1,..., An). Then the eigendecomposition of Tyyn = 1@Ty+Ty 81 
is 


I@Ty+Ty @I=(Z@Z)-(I@A+AQlI)-(Z@Z)’. (6.18) 


IT@A+A@QI is a diagonal matrix whose (iN + j)th diagonal entry, the (i,j) th 
eigenvalue of Tyyn, 18 Aig = A +à; Z8 Z is an orthogonal matrix whose 
(iN + j)th column, the corresponding eigenvector, is zi Q zj. 


Proof. From parts 1 and 3 of Lemma 6.3, it is easy to verify that Z @ Z is 
orthogonal, since (Z®Z)(Z@Z)? = (Z& Z\(ZT8 ZT) = (Z.ZT) (Z.ZT) = 
I&I =I. We can now verify equation (6.18): 


(ZZ) (I8A+4A8I) (Z8 Z)" 
= (Z@Z)-(I@A+A@l)-(Z' @Z") 
by part 3 of Lemma 6.3 
(Z-1-Z7)@(Z-A-Z")+(Z-A-Z*)@(Z-1-Z") 
by part 1 of Lemma 6.3 
= (1) ® (Tw) + (Ty) ® (I) 
TNxN- 


Also, it is easy to verify that [®A+A@/ is diagonal, with diagonal entry (iN + 
j) given by A; + Aj, so that equation (6.18) really is the eigendecomposition 
of Tyxn. Finally, from the definition of Kronecker product, one can see that 
column iN + j of Z @ Z is zi Q zj. 

The reader can confirm that the eigenvector z; & zj = vec(z;z/), thus 
matching the expression for an eigenvector in equation (6.12). 

For a generalization of Proposition 6.1 to the matrix AQI + BT @I, which 
arises when solving the Sylvester equation AX — XB = C, see Question 6.5 
(and Question 4.6). 

Similarly, Poisson’s equation in three dimensions leads to 


Tnxnxn = Ty 8 In 8 In + In 8 Tyn 8 In+ In ® In @Ty, 


with eigenvalues all possible triple sums of eigenvalues of Ty, and eigenvector 
matrix Z & Z & Z. Poisson’s equation in higher dimensions is represented 
analogously. 


Iterative Methods for Linear Systems 277 


Method Serial Space Direct or Section 
Time Iterative 
Dense Cholesky n? n? D 2.7.1 
Explicit inverse n? n? D 
Band Cholesky n? n3/2 D 2.7.3 
Jacobi n? n I 6.5 
Gauss-Seidel n? n I 6.5 
Sparse Cholesky n3/2 n -logn D 2.7.4 
Conjugate gradients mer? n I 6.6 
Successive overrelaxation n3/2 n I 6.5 
SSOR with Chebyshev accel.  n5⁄4 n I 6.5 
Fast Fourier transform n- logn n D 6.7 
Block cyclic reduction n- logn n D 6.8 
Multigrid n n I 6.9 
Lower bound n n 


Table 6.1. Order of complexity of solving Poisson’s equation on an N-by-N grid 
(n = N?). 


6.4. Summary of Methods for Solving Poisson’s Equa- 
tion 


Table 6.1 lists the costs of various direct and iterative methods for solving 
the model problem on an N-by-N grid. The variable n = N?, the number 
of unknowns. Since direct methods provide the exact answer (in the absence 
of roundoff), whereas iterative methods provide only approximate answers, we 
must be careful when comparing their costs, since a low-accuracy answer can be 
computed more cheaply by an iterative method than a high-accuracy answer. 
Therefore, we compare costs, assuming that the iterative methods iterate often 
enough to make the error at most some fixed small value”® (say, 107ê). 

The second and third columns of Table 6.1 give the number of arithmetic 
operations (or time) and space required on a serial machine. Column 4 indi- 
cates whether the method is direct (D) or iterative (I). All entries are meant in 
the O(-) sense; the constants depend on implementation details and the stop- 
ping criterion for the iterative methods (say, 1076). For example, the entry for 
Cholesky also applies to Gaussian elimination, since this changes the constant 
only by a factor of two. The last column indicates where the algorithm is 
discussed in the text. 

The methods are listed in increasing order of speed, from slowest (dense 


26 Alternatively, we could iterate until the error is O(h?) = O((N + 1)~°), the size of the 
truncation error. One can show that this would increase the costs of the iterative methods 
in Table 6.1 by a factor of O(log n). 
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Cholesky) to fastest (multigrid), ending with a lower bound applying to any 
method. The lower bound is n because at least one operation is required per 
solution component, since otherwise they could not all be different and also 
depend on the input. The methods are also, roughly speaking, in order of de- 
creasing generality, with dense Cholesky applicable to any symmetric positive 
definite matrix and later algorithms applicable (or at least provably conver- 
gent) only for limited classes of matrices. In later sections we will describe the 
applicability of various methods in more detail. 

The “explicit inverse” algorithm refers to precomputing the explicit inverse 
of Tyy nN, and computing v = T ae nf by a single matrix-vector multiplication 
(and not counting the flops to precompute Ty, l yn). Along with dense Cholesky, 
it uses n? space, vastly more than the other methods. It is not a good method. 
Band Cholesky was discussed in section 2.7.3; this is just Cholesky taking 
advantage of the fact that there are no entries to compute or store outside a 
band of 2N + 1 diagonals. 

Jacobi and Gauss-Seidel are classical iterative methods and not particu- 
larly fast, but they form the basis for other faster methods: successive overre- 
laxation, symmetric successive overrelaxation, and multigrid, our fastest algo- 
rithm. So we will study them in some detail in section 6.5. 

Sparse Cholesky refers to the algorithm discussed in section 2.7.4: it is 
an implementation of Cholesky that avoids storing or operating on the zero 
entries of Tyxy or its Cholesky factor. Furthermore, we are assuming the 
rows and columns of Tyxn have been “optimally ordered” to minimize work 
and storage (using nested dissection [110, 111]). While sparse Cholesky is 
reasonably fast on Poisson’s equation in two dimensions, it it significantly 
worse in three dimensions (using O(N®) = O(n?) time and O(N*) = O(n4®) 
space), because there is more “fill-in” of zero entries during the algorithm. 

Conjugate gradients, while not particularly fast on our model problem, 
are a representative of a much larger class of methods, called Krylov subspace 
methods, which are very widely applicable both for linear system solving and 
finding eigenvalues of sparse matrices. We will discuss these methods in more 
detail in section 6.6. 

The fastest methods are block cyclic reduction, the fast Fourier transform 
(FFT), and multigrid. In particular, multigrid does only O(1) operations per 
solution component, which is asymptotically optimal. 

A final warning is that this table does not give a complete picture, since 
the constants are missing. For a particular size problem on a particular ma- 
chine, one cannot immediately deduce which method is fastest. Still, it is clear 
that iterative methods such as Jacobi, Gauss-Seidel, conjugate gradients, and 
successive overrelaxation are inferior to the FFT, block cyclic reduction, and 
multigrid for large enough n. But they remain of interest because they are 
building blocks for some of the faster methods, and because they apply to 
larger classes of problems than the faster methods. 

All of these algorithms can be implemented in parallel; see the lectures 
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on PARALLEL_HOMEPAGE for details. It is interesting that, depending on 
the parallel machine, multigrid may no longer be fastest. This is because on 
a parallel machine the time required for separate processors to communicate 
data to one another may be as costly as the floating point operations, and 
other algorithms may communicate less than multigrid. 


6.5. Basic Iterative Methods 


In this section we will talk about the most basic iterative methods: 


Jacobi’s 

Gauss-Seidel, 

successive overrelaxation (SOR(w)), 

Chebyshev acceleration with symmetric successive overrelaxation 


(SSOR(w)). 


These methods are also discussed and their implementations are provided at 
NETLIB/ templates. 

Given zo, these methods generate a sequence £m converging to the solution 
A-‘b of Ax = b, where 241 is cheap to compute from £m. 


DEFINITION 6.3. A splitting of A is a decomposition A = M — K, with M 
nonsingular. 


A splitting yields an iterative method as follows: Ax = Ma — Ka = b 
implies Mz = Kz +b or x = M7!Kx+M~'b = Rz +c. So we can take 
Im+1 = REm + c as our iterative method. Let us see when it converges. 


LEMMA 6.4. Let ||-|| be any operator norm (||R|| = max,=o D. Tf \|R|| < 1, 


then £m+1 = R&m +c converges for any xo. 


Proof. Subtract x = Rr+c from %m41 = Rrmtc to get tm41—e = R(@m—2). 
Thus ||£m41 — 2|| < IRI- lam — 2|| < || RITH - ||zo — zl], which converges to 
0 since ||R|| < 1. 

Our ultimate convergence criterion will depend on the following property 
of R. 


DEFINITION 6.4. The spectral radius of R is p(R) = max |A|, where the maz- 
imum is taken over all eigenvalues of R. 


LEMMA 6.5. For all operator norms p(R) < ||R||. For all R and for all € > 0 
there is an operator norm ||- ||, such that ||R\|, < p(R) +e. ||- ||, depends on 
both R and e. 
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Proof. To show p(R) < ||R|| for any operator norm, let x be an eigenvector for 
A, where p(R) = |A| and so ||R|| = maxy—o La > Le = Pel — Al. 


«|| læll 


To construct an operator norm ||-||, such that || Rl], < p(R)+e, let STIRS = 
J be in Jordan form. Let De = diag(1,¢,€?,...,¢€"~1). Then 


(SD) R(SDD = De" ID: 


À2 E 


re | 
> 
an 
nm 
E $$ 


A2 


i.e., a “Jordan form” with e’s above the diagonal. Now use the vector norm 
zll = ||(SD-)~tz||0 to generate the operator norm 


jae aay le 
x=0 |x|. 
— ag HCSDeI Reh 
w=0 [CS D2) tala 
—1 
gg ISB*R(SD.)UIl 
y=0 ales 
=) (SDJ *R(SD.) llo 


= max|A,| +e 
(3 


= p(R)+e. 


THEOREM 6.1. The iteration tm41 = R&m + c converges to the solution of 
Ax = b for all starting vectors xo and for all b if and only if p(R) < 1. 


Proof. If p(R) > 1, choose zo — x to be an eigenvector of R with eigenvalue 
A where |A| = o(R). Then 


(m41 — £) = R(om— £) = = RO (ag — £) = A” ti (zo — z) 


will not approach 0. If p(R) < 1, use Lemma 6.5 to choose an operator norm so 
|R\|, < 1 and then apply Lemma 6.4 to conclude that the method converges. 


DEFINITION 6.5. The rate of convergence of £m+1 = R&m + c is r(R) = 
— log10 p(R). 
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r(R) is the increase in the number of correct decimal places in the solution 
per iteration, since logyg ||@m — 2||x — logio ||@m41 — zlļx > r(R) + O(c). The 
smaller is p(R), the higher is the rate of convergence, i.e., the greater is the 
number of correct decimal places computed per iteration. 

Our goal is now to choose a splitting A = M — K so that both 


(1) Re = M~'Ka and c= M~'b are easy to evaluate, 
(2) p(R) is small. 


We will need to balance these conflicting goals. For example, choosing M = I 
is good for goal (1) but may not make p(R) < 1. On the other hand, choosing 
M = A and K = 0 is good for goal (2) but probably bad for goal (1). 

The splittings for the methods discussed in this section all share the fol- 
lowing notation. When A has no zeros on its diagonal, we write 


A=D-L-Ŭ=D(I-L-U]), (6.19) 


where D is the diagonal of A, —L is the strictly lower triangular part of A, 
DL = L, —U is the strictly upper triangular part of A, and DU = U. 


6.5.1. Jacobi s Method 


Jacobi’s method can be described as repeatedly looping through the equations, 
changing variable j so that equation j is satisfied exactly. Using the notation of 
equation (6.19), the splitting for Jacobi’s method is A = D—(L+U); we denote 
Ry; = D“\(L4+U) = L+U and cy; = D~b, so we can write one step of Jacobi’s 
method as &m41 = RJ£m+cyz. To see that this formula corresponds to our first 
description of Jacobi’s method, note that it implies D£m+1 = (L +U Em + b, 
AjjLm+1j = — paz AjkEm,k + bj, OF ajj£m+1,j +) p=j UjkTm,k = bj. 


ALGORITHM 6.1. One step of Jacobi’s method: 


forj=1 ton 
Cite = ay (3 — aaj ljkEm,k) 
end for 


In the special case of the model problem, the implementation of Jacobi’s 
algorithm simplifies as follows. Working directly from equation (6.10) and 
letting Umj,j denote the mth value of the solution at grid point i,j, Jacobi’s 
method becomes the following. 


ALGORITHM 6.2. One step of Jacobi’s method for two-dimensional Poisson’s 
equation: 


fori=1 to N 
forj=ltoN 
ies = (Omii t vmit F Umaji t ee pe F h? fij)/4 
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end for 
end for 


In other words, at each step the new value of v;; is obtained by “averaging” 
its neighbors with h? fiz. Note that all new values vm+41,;,; may be computed 
independently of one another. Indeed, Algorithm 6.2 can be implemented in 
one line of Matlab if the Ym+1,i,j are stored in a square array V that includes 
an extra first and last row of zeros and first and last column of zeros (see 
Question 6.6). 


6.5.2. Gauss-Seidel Method 


The motivation for this method is that at the jth step of the loop for Jacobi’s 
method, we have improved values of the first 7 — 1 components of the solution, 
so we should use them in the sum. 


ALGORITHM 6.3. One step of the Gauss—Seidel method: 


forj=lton 


j-l n 
1 
Tmt = a5 bj — > Ajkem41,k — > QjkEm,k 
kel k=j+1 


updated x’s older x’s 
end for 


For the purpose of later analysis, we want to write this algorithm in the form 
£m+1 = Restm+cas. To this end, note that it can first be rewritten as 


j n 
`> Qjkm+1,k = `> AjkEm,k + bj. (6.20) 
k=1 


k=j+1 
Then using the notation of equation (6.19), we can rewrite equation (6.20) as 
(D — L)£m+1 = Utm +b or 
tua = (DD tŪrm+(D-— Ltb 
= (I-L) Uzm +(I-— L) 1D tb 
Resim + cas: 


As with Jacobi’s method, we consider how to implement the Gauss-Seidel 
method for our model problem. In principle it is quite similar, except that we 
have to keep track of which variables are new (numbered m + 1) and which 
are old (numbered m). But depending on the order in which we loop through 
the grid points i,j, we will get different (and valid) implementations of the 
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Gauss-Seidel method. This is unlike Jacobi’s method, in which the order in 
which we update the variables is irrelevant. For example, if we update um,1,1 
first (before any other vm i,j), then all its neighboring values are necessarily 
old. But if we update vm,1,1 last, then all its neighboring values are necessarily 
new, so we get a different value for vm,1,1. Indeed, there are as many possible 
implementations of the Gauss-Seidel method as there are ways to order N? 
variables (namely, N?!). But of all these orderings, only two are of interest. 
The first is the ordering shown in Figure 6.4; this is called the natural ordering. 

The second ordering is called red-black ordering. It is important because 
our best convergence results in sections 6.5.4 and 6.5.5 depend on it. To ex- 
plain red-black ordering, consider the chessboard-like coloring of the grid of 


unknowns below; the nodes correspond to the black squares on a chess- 


board, and the ® nodes correspond to the red squares. 
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The red-black ordering is to order the red nodes before the black nodes. 
Note that red nodes are adjacent to only black nodes. So if we update all the 
red nodes first, they will use only old data from the black nodes. Then when 
we update the black nodes, which are only adjacent to red nodes, they will use 
only new data from the red nodes. Thus the algorithm becomes the following. 


ALGORITHM 6.4. One step of the Gauss-Seidel method on two-dimensional 
Poisson’s equation with red-black ordering: 


for all nodes i,j that are red (®) 

Um+L ig = (Umi-1,j + Umit j + Umag—1 + Um ijti +h? fig) /4 
end for 
for all nodes i,j that are black ( ®) 


_ 2 
Um+l,ij = (Um+1,i—1,j + Umit, j H Utila ea ge Fy) 4 
end for 


6.5.3. Successive Overrelaxation 


We refer to this method as SOR(w), where w is the relaxation parameter. 
The motivation is to improve the Gauss-Seidel loop by taking an appropriate 
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weighted average of the £m+1,j and mj: 
SOR’s tm4ij = (1 —w)tmyj + WEm+,j, 
yielding the following algorithm. 
ALGORITHM 6.5. SOR: 
forj=lton 


.— 41 w |p. j=l n , 
Em+,j = (1 — w)Em,j + a5) fo; — pai ljktm+1,k — k=j+1 Ajk&m,k 
end for 


We may rearrange this to get, for j = 1 to n, 


j=1 u 
Ajjm+1,j FW ò Qjktm+1,k = (1 = w)ajjEm, j — w ò AjkEm,k + wb; 
k=1 k=j+1 


or, again using the notation of equation (6.19), 
(D -wÏ)amy = (1 - w) D +Ü )am + wb 
or 
(D —wl)7'((1-—w)D+ wŬ)rm + w(D — wL) tb 
(I = wL) (=) I +o) 2,4 w(I — wL) tD tb 
= Rsorw)%m + CSOR(w)- (6.21) 


Lm+1 


We distinguish three cases, depending on the values of w: w = 1 is equiv- 
alent to the Gauss-Seidel method, w < 1 is called underrelaxation, and w > 1 
is called overrelazation. A somewhat superficial motivation for overrelaxation 
is that if the direction from £m to £m+1 is a good direction in which to move 
the solution, then moving w > 1 times as far in that direction is better. 

In the next two sections, we will show how to pick the optimal w for the 
model problem. This optimality depends on using red-black ordering. 


ALGORITHM 6.6. One step of SOR(w) on two-dimensional Poisson’s equation 
with red-black ordering: 


for all nodes i,j that are red (®) 
Um+1,ij = (1 — wWm,i jt 
w(Um,i—1,j + Um,i+1,j + Umi,j—1 + Umit +h? fij)/4 
end for 
for all nodes i,j that are black (®) 
Um+1,ij = (1 — w)Wm,i, j+ 
W(Um+1,i—1,j + Um41i41,5 + Umi j-1 + Umt i jt +h? fiz)/4 


end for 
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6.5.4. Convergence of Jacobi’s, Gauss-Seidel, and 
SOR(w) Methods on the Model Problem 


It is easy to compute how fast Jacobi’s method converges on the model problem, 
since the corresponding splitting is Tyx jy = 4I — (4I — Tyxn), and so Ry = 
(41)-1(4I —Tyxn) = I — Tyxn /4. Thus the eigenvalues of Ry are 1 — Ai,j/4, 
where the A; j are the eigenvalues of Tyx n: 


Ti TJ 
Ny =A +A = 4—2 (c08 5 f ED 


p(R7) is the largest of |1 — A;,;/4|, namely, 
2 


T aij T 
N+1° AYN+1) 


p(Ry) = i — à1,1/4l = |1 -— An,n/A| = cos 


Note that as N grows and T becomes more ill-conditioned, the spectral 
radius p(R;) approaches 1. Since the error is multiplied by the spectral radius 
at each step, convergence slows down. To estimate the speed of convergence 
more precisely, let us compute the number m of Jacobi iterations required to 
decrease the error by e7! = exp(—1). Then m must satisfy (p(R,z))” = e7}, 
(1-— aay)” =e" 1 omx a = O(N?) = O(n). Thus the number of 
iterations is proportional to the number of unknowns. Since one step of Jacobi 
costs O(1) to update each solution component or O(n) to update all of them, 
it costs O(n”) to decrease the error by e~! (or by any constant factor less than 
1). This explains the entry for Jacobi’s method in Table 6.1. 

This is a common phenomenon: the more ill-conditioned the original prob- 
lem, the more slowly most iterative methods converge. There are important 
exceptions, such as multigrid and domain decomposition, which we discuss 
later. 

In the next section we will show, provided that the variables in Poisson’s 
equation are updated in red-black order (see Algorithm 6.4 and Corollary 6.1), 
that p(Rcs) = p(RJ = cos? ye. In other words, one Gauss-Seidel 
step decreases the error as much as two Jacobi steps. This is a general phe- 
nomenon for matrices arising from approximating differential equations with 
certain finite difference approximations. This also explains the entry for the 
Gauss-Seidel method in Table 6.1; since it is only twice as fast as Jacobi, it 
still has the same complexity in the O(-) sense. 

For the same red-black update order (see Algorithm 6.6 and Theorem 6.7), 


we will also show that for the relaxation parameter 1 < w = 2/(1+sin yyy) < 2 


2 8 
£08 N+1 


2 
P(Rsorw)) = x 1 2 7 for large N. 


(1+ sin yp)’ N+ 


This is in contrast to p(R) = 1 — O( xz) for Ry and Rgs. This is the optimal 
value for w; i.e., it minimizes Rsor(w). With this choice of w, SOR(w) is 
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approximately N times faster than Jacobi’s or the Gauss-Seidel method, since 
if SOR(w) takes j steps to decrease the error as much as k steps of Jacobi’s or 
the Gauss-Seidel method, then (1 — xz)” x (1- $), implying 1 — ty xl- Z 
or k~ j- N. This lowers the complexity of SOR(w) from O(n?) to O(n3/?), as 
shown in Table 6.1. 

In the next section we will show generally for certain finite difference ma- 
trices how to choose w to minimize p(Rsorw))- 


6.5.5. Detailed Convergence Criteria for Jacobi’s, 
Gauss-Seidel, and SOR(w) Methods 


We will give a sequence of conditions that guarantee the convergence of these 
methods. The first criterion is simple to apply but is not always applicable, in 
particular not to the model problem. Then we give several more complicated 
criteria, which place stronger conditions on the matrix A but in return give 
more information about convergence. These more complicated criteria are 
tailored to fit the matrices arising from discretizing certain kinds of partial 
differential equations such as Poisson’s equation. 
Here is a summary of the results of this section: 


1. If A is strictly row diagonally dominant (Definition 6.6), then Jacobi’s 
and the Gauss-Seidel methods both converge, and the Gauss-Seidel 
method is faster (Theorem 6.2). Strict row diagonal dominance means 
that each diagonal entry of A is larger in magnitude than the sum of the 
magnitudes of the other entries in its row. 


2. Since our model problem is not strictly row diagonally dominant, the 
last result does not apply. So we ask for a weaker form of diagonal dom- 
inance (Definition 6.11) but impose a condition called irreducibility on 
the pattern of nonzero entries of A (Definition 6.7) to prove convergence 
of Jacobi’s and the Gauss-Seidel methods. The Gauss-Seidel method 
again converges faster than Jacobi’s method (Theorem 6.3). This result 
applies to the model problem. 


3. Turning to SOR(w), we show that 0 < w < 2 is necessary for convergence 
(Theorem 6.4). If A is also positive definite (like the model problem), 
0 <w < 2 is also sufficient for convergence (Theorem 6.5). 


4. To quantitatively compare Jacobi’s, Gauss-Seidel, and SOR(w) methods, 
we make one more assumption about the pattern of nonzero entries of A. 
This property is called Property A (Definition 6.12) and is equivalent to 
saying that the graph of the matrix is bipartite. Property A essentially 
says that we can update the variables using red-black ordering. Given 
Property A there is a simple algebraic formula relating the eigenvalues 
of Rj, Res, and Rgor(w) (Theorem 6.6), which lets us compare their 
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rates of convergence. This formula also lets us compute the optimal w 
that makes SOR(w) converge as fast as possible (Theorem 6.7). 


DEFINITION 6.6. A is strictly row diagonally dominant if jau] > J j=; laiz] 
for alli. 


THEOREM 6.2. If A is strictly row diagonally dominant, Jacobi’s and the 
Gauss-Seidel methods both converge. In fact ||Rags|lo < ||Rulloo < 1. 


The inequality ||Res|lo < ||Rz\|.. implies that one step of the worst prob- 
lem for the Gauss-Seidel method converges at least as fast as one step of 
the worst problem for Jacobi’s method. It does not guarantee that for any 
particular Az = b, the Gauss-Seidel method will be faster than Jacobi’s 
method; Jacobi’s method could “accidentally” have a smaller error at some 
step. Proof. Again using the notation of equation (6.19), we write Ry = L+U 


and Rgs = (I — L)~!U. We want to prove 


\|Reslloo = |l|Raslelloo < ||| Rslelloo = ||Rulloo, (6.22) 


where e = [1,...,1]* is the vector of all ones. Inequality (6.22) will be true if 
can prove the stronger componentwise inequality 


I(T — L)*U|-e = |Res|-e < |Rj| - e = (|L| + |U]) -e. (6.23) 
Since 
(I-L) tU|-e < |T- L)!|-|U|-e by the triangle inequality 

n—1 

= ee -|U|-e since L” =0 
i=0 
n—1 ; 

< o |L|’- |U|-e by the triangle inequality 
i=0 


= (I-|L|)1-|U|-e since |L|" = 0, 


inequality (6.23) will be true if can prove the even stronger componentwise 
inequality 
(I-|L|)~*-|U|-e < (E| + |U])-e. (6.24) 


Since all entries of (I — |L|)~! = ee. |L|’ are nonnegative, inequality (6.24) 
will be true if we can prove 


|U] -e < (T — |L]) - (Z| + |Ul)-e = (£| + |U] = |£}? — |Z] -|U]) -e 


or 


a a a a E a |U]) -e. (6.25) 
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Since all entries of |L| are nonnegative, inequality (6.25) will be true if we can 
prove 


0<(I-|L|-—|U|)-e or |Ryz|-e=(\L|4 |U)je<e. (6.26) 


Finally, inequality (6.26) is true because by assumption |||Rz|-elloo = ||Ru|lo = 
p<. 
An analogous result holds when A is strictly column diagonally dominant 
(ie., AT is strictly row diagonally dominant). 
The reader may easily confirm that this simple criterion does not apply to 
the model problem, so we need to weaken the assumption of strict diagonal 
dominance. Doing so requires looking at the graph properties of a matrix. 


DEFINITION 6.7. A is an irreducible matrix if there is no permutation matrix 


P such that J 4 
T E 11 12 


We connect this definition to graph theory as follows. 


DEFINITION 6.8. A directed graph is a finite collection of nodes connected by 
a finite collection of directed edges, i.e., arrows from one node to another. A 
path in a directed graph is a sequence of nodes ni,..., nm with an edge from 
each ni to ni+1. A self edge is an edge from a node to itself. 


DEFINITION 6.9. The directed graph of A, G(A), is a graph with nodes 1,2,...,n 
and an edge from node i to node j if and only if aij = 0. 


EXAMPLE 6.1. The matrix 


has the directed graph 


DEFINITION 6.10. A directed graph is called strongly connected if there exists 
a path from every node i to every node j. A strongly connected component of 
a directed graph is a subgraph (a subset of the nodes with all edges connecting 
them) which is strongly connected and cannot be made larger yet still be strongly 
connected. 


© 
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EXAMPLE 6.2. The graph in Example 6.1 is strongly connected. © 


1 bi 
1 
1 
which has the directed graph 


This graph is not strongly connected, since there is no path to node 1 from 
anywhere else. Nodes 4, 5, and 6 form a strongly connected component, since 
there is a path from any one of them to any other. © 


EXAMPLE 6.3. Let 


EXAMPLE 6.4. The graph of the model problem is strongly connected. The 
graph is essentially 


except that each edge in the grid represents two edges (one in each direction), 
and the self edges are not shown. © 


LEMMA 6.6. A is irreducible if and only if G(A) is strongly connected. 


Proof. If A= | ne 


the nodes corresponding to A22 back to the ones corresponding to A11; i.e., 
G(A) is not strongly connected. Similarly, if G(A) is not strongly connected, 
renumber the rows (and columns) so that all the nodes in a particular strongly 
connected component come first; then the matrix PAP? will be block upper 
triangular. 


] is reducible, then there is clearly no way to get from 


EXAMPLE 6.5. The matrix A in Example 6.3 is reducible. 

DEFINITION 6.11. A is weakly row diagonally dominant if for all i, |a| > 
yo pai laik| with strict inequality at least once. 

‘THEOREM 6.3. If A is irreducible and weakly row diagonally dominant, then 


both Jacobi’s and Gauss-Seidel methods converge, and p(Ras) < p( Rs) < 1. 


For a proof of this theorem, see [247]. 
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EXAMPLE 6.6. The model problem is weakly diagonally dominant and irre- 
ducible but not strongly diagonally dominant. (The diagonal is 4, and the 
offdiagonal sums are either 2, 3, or 4.) So Jacobi’s and Gauss-Seidel methods 
converge on the model problem. © 


Despite the above results showing that under certain conditions the Gauss- 
Seidel method is faster than Jacobi’s method, no such general result holds. 
This is because there are nonsymmetric matrices for which Jacobi’s method 
converges and the Gauss-Seidel method diverges, as well as matrices for which 
the Gauss-Seidel method converges and Jacobi’s method diverges [247]. 

Now we consider the convergence of SOR(w) [247]. Recall its definition: 


Rsor(w) = (I — wL)! (1 — w)I + wv). 


THEOREM 6.4. p(Rsorw)) = |w— 1|. Therefore 0 < w < 2 is required for 
convergence. 


Proof. Write the characteristic polynomial of RsorR(w) as (à) = det(AI — 
Rsorw)) = det((I — wL) (AI — Rsorw))) = det((A +w — 1) — wAL — wU) so 
that 


(0) = + | [ X (Rsorw)) = +det((w — 1)I) = (w — 1)”, 
i=1 


implying max; |Ai(Rsorw))| > |w— 1]. 


THEOREM 6.5. If A is symmetric positive definite, then p(Rgorw)) < 1 for 
all 0<w <2, so SOR(w) converges for all 0 < w <2. Taking w = 1, we see 
that the Gauss-Seidel method also converges. 


Proof. There are two steps. We abbreviate Rsor(w) = R. Using the notation 
of equation (6.19), let M = w™t(D — wL). Then we 


(1) define Q = A~!(2M — A) and show RA; (Q) > 0 for all i, 
(2) show that R = (Q — D(Q + J)“, implying |);(R)| < 1 for all i. 


For (1), note that Qr = Ax implies (2M — A)e = Arx or x*(2M — A)z = 
Ax* Ax. Add this last equation to its conjugate transpose to get «*(M + M* — 
A)x = (RA)(a* Ax). So RA = a*(M+M*— A)a/2* Ax = x* (2 —1)Dx/x* Ax > 
0 since A and (2 — 1)D are positive definite. 

To prove (2), note that (Q — 1)(Q+ I)! = (2A71M — 21)(2A7-!M)-1 = 
I — MT!A = R, so by the spectral mapping theorem (Question 4.5) 


<l; 


1 
(Q) - | = es — 1)? + (SA(Q))? |? 
(Q) +1] [(RA(Q) + 1)? + (SA(Q))? 

Together, Theorems 6.4 and 6.5 imply that if A is symmetric positive def- 
inite, then SOR(w) converges if and only if 0 < w < 2. 


AmI = [5 
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EXAMPLE 6.7. The model problem is symmetric positive definite, so SOR(w) 
converges for0<w<2. © 


For the final comparison of the costs of Jacobi’s, Gauss-Seidel, and SOR(w) 
methods on the model problem we impose another graph theoretic condition 
on A that often arises from certain discretized partial differential equations, 
such as Poisson’s equation. This condition will let us compute p(R@s) and 
P(Rsorw)) explicitly in terms of p(R)). 


DEFINITION 6.12. A matriz T has property A if there exists a permutation P 
such that 


where Tiy and Tos are diagonal. In other words in the graph G(A) the nodes 
divide into two sets Sı U S2, where there are no edges between two nodes both 
in Sı or both in Sz (ignoring self edges); such a graph is called bipartite. 


EXAMPLE 6.8. Red-black ordering for the model problem. This was introduced 
in section 6.5.2, using the following chessboard-like depiction of the graph of 


the model problem: The black nodes are in 51, and the red ® nodes are 
in So. 
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As described in section 6.5.2, each equation in the model problem relates 
the value at a grid point to the values at its left, right, top, and bottom 
neighbors, which are colored differently from the grid point in the middle. In 
other words, there is no direct connection from an ® node to an ® node 


or from a node toa node. So if we number the red nodes before the 
black nodes, the matrix will be in the form demanded by Definition 6.12. For 
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example, in the case of a 3-by-3 grid, we get the following: 


4 —1 zī 
zje -i —1 
-1 4 —1 
| =I 4 -1 =] | 
P aj aT | oT pr 
| af = A =j | 
zj 4 -1 
| —1 -1 4 a 
—1 -1 4 
4 = =] 
res a || 
4 bot! =p, 29 
| te, oe 
= | 4 —-1 -1 | © 
i i, Set 4 
—1 =j į 4 | 
-1 1 a4 4 
AF wd: test 4 | 


Now suppose that T has Property A, so we can write (where D; = Tj; is 
diagonal) 


Mo a Dı Ti2 = Dı _ 0 0 _ 0 —T\2 
BS ad | al ema fo 
= D-L-U. 


DEFINITION 6.13. Let Ry(a) =aL+4U. Then Rj(1) = Ry is the iteration 
matrix for Jacobi’s method. 


PROPOSITION 6.2. The eigenvalues of Rz(a) are independent of a. 


Proof. 
_ 0 ¿Di Teo 


has the same eigenvalues as the similar matrix 


| al | Rata) | ‘ al | = | poh m | = Rj(1). 


DEFINITION 6.14. Let T be any matriz, with T = D — L—U and R;(a) = 
aDIL + LDU: If Rz(a)’s eigenvalues are independent of a, then T is 
called consistently ordered. 
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It is an easy fact that if T has Property A, such as the model problem, 
then PTP? is consistently ordered for the permutation P that makes PT P! = 
Ti Tie 

Ta The 
implies a matrix has property A. 


] have diagonal 7); and To2. It is not true that consistent ordering 


EXAMPLE 6.9. Any block tridiagonal matrix 


ee | 


An-1 
By-1 Dn 


is consistently ordered when the D; are diagonal. © 


Consistent ordering implies that there are simple formulas relating the 
eigenvalues of Ry, Res, and Rsor(w) [247]. 


THEOREM 6.6. If A is consistently ordered and w = 0, then the following are 
true: 


1) The eigenvalues of Ry appear in + pairs. 
2) If u is an eigenvalue of Ry and 
(Qt w—1)? = dw, (6.27) 
then A is an eigenvalue of Rsorww)- 


3) Conversely, if X = 0 is an eigenvalue of Rsorw), then u in equa- 
tion (6.27) is an eigenvalue of Ry. 


Proof. 


1) Consistent ordering implies that the eigenvalues of Ry(a) are indepen- 
dent of a, so Ry = Rj(1) and R;z(—1) = —R (1) have same eigenvalues; 
hence they appear in + pairs. 


2) If \=0 and equation (6.27) holds, then w = 1 and 0 is indeed an eigen- 
value of Rsora) = Ras = (I — L)~1U since Ras is singular. Otherwise 
0 = det (AT o Rsorw)) 
= det((J—wL)(AI — Rsorw))) 
= det((A+w-— 1) —wAL — wU) 


= (a(r va 0) 


= act ( (2H); L u) wy, 


where the last equality is true because of Proposition 6.2. Therefore 


Ate = u, an eigenvalue of L + U = Ry, and (ÀA +w —1)? = p?w?. 
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3) If A = 0, the last set of equalities works in the opposite direction. 


COROLLARY 6.1. If A is consistently ordered, then p(Ras) = (p(Rz))*. This 
means that the Gauss-Seidel method is twice as fast as Jacobi’s method. 


Proof. The choice w = 1 is equivalent to the Gauss-Seidel method, so 
X = Au? or à= p? 

To get the most benefit from overrelaxation, we would like to find wopt 
minimizing p(Rsorw)) [247]. 


THEOREM 6.7. Suppose that A is consistently ordered, Rj has real eigenval- 
ues, and p = p(Rz) <1. Then 


Wopt 


2 


u 
P(RSORWwopt)) = Wop -1 = at 
[+ v1 pe 


w— 1, Wopt SW < 2, 
l-w+ tw? u? + wpy/1— w+ du2p?, 0 < w < Wopt- 


Proof. Solve (A +w — 1)? = àw? p? for À. 


P(Rsorw)) 


EXAMPLE 6.10. The model problem is an example: Rj is symmetric, so it has 
real eigenvalues. Figure 6.5 shows a plot of o(RsoR(w)) versus w, along with 
p(Ras) and p(Rz), for the model problem on an N-by-N grid with N = 16 
and N = 64. The plots on the left are of p(R), and the plots on the right 
are semilogarithmic plots of 1 — p(R). The main conclusion that we can draw 
is that the graph of po(RsoRr(w)) has a vary narrow minimum, so if w is even 
slightly different from wopt, the convergence will slow down significantly. The 
second conclusion is that if you have to guess wWopt, a large value (near 2) is a 
better guess than a small value. © 


6.5.6. Chebyshev Acceleration and Symmetric SOR (SSOR) 


Of the methods we have discussed so far, Jacobi’s and Gauss-Seidel methods 
require no information about the matrix to execute them (although proving 
that they converge requires some information). SOR(w) depends on a param- 
eter w, which can be chosen depending on p(R,z) to accelerate convergence. 
Chebyshev acceleration is useful when we know even more about the spectrum 
of Rj than just p(R,z) and lets us further accelerate convergence. 

Suppose that we convert Ax = b to the iteration x34, = Rz; +c using some 
method (Jacobi’s, Gauss-Seidel, or SOR(w)). Then we get a sequence {2;} 
where z; > x as i > oo if p(R) < 1. 
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16 by 16 grid ‘ 16 by 16 grid 
1 a SS 10 
ne NS 
0.6 
0.4; — - - Jacobi 
7 3 - - Jacobi 
0.2 Gauss-Seidel 10 -.— Gauss-Seidel 
5 — SOR\(w) 107 —— SOR\(w) 
0 0.5 1 1.5 2 0 0.5 1 1.5 2 
w w 
64 by 64 grid 7 64 by 64 grid 
1 10 
= 
0.8 10" - -—- Jacobi 
-—-— Gauss-Seidel 
0.6 o| —— SORw) 
10 
0.4; =-=- Jacobi es 
A Oil eRe REN ET N T E O 
0.21 TT Gauss-Seidel 10 
— SOR(w) : 
0 10° 
0 0.5 1 1.5 2 0 0.5 1 1.5 2 
w w 


Fig. 6.5. Convergence of Jacobi’s, Gauss-Seidel, and SOR(w) methods versus w on 
the model problem on a 16-by-16 grid and a 64-by-64 grid. The spectral radius p(R) 


of each method (p(Rz), p( Ras), and p(Rsorw))) is plotted on the left, and 1 — p(R) 
on the right. 
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Given all these approximations x;, it is natural to ask whether some linear 
combination of them, ym = ee Ymiti, is an even better approximation of 
the solution x. Note that the scalars Ymi must satisfy S77") mi = 1, since if 


zo = £1 =- = £, we want Ym = x, too. So we can write the error in Ym as 
m 
Ym 7T = Smit =T 

i=0 
m 

= Gi — a) 
i=0 
m 

= Y imik'(eo - 2) 
i=0 

= Pm(R)(xo- 2), (6.28) 


where pm(R) = $; o YmiR’ is a polynomial of degree m with pm(1) = reg Ymi 
=]; 


EXAMPLE 6.11. If we could choose pm to be the characteristic polynomial of 
R, then pm(R) = 0 by the Cayley-Hamilton theorem, and we would converge 
in m steps. But this is not practical, because we seldom know the eigenvalues 
of R and we want to converge much faster than in m = dim(R) steps anyway. 
© 


Instead of seeking a polynomial such that p,,(R) is zero, we will settle for 
making the spectral radius of p,,(R) as small as we can. Suppose that we knew 


e the eigenvalues of R were real, 

e the eigenvalues of R lay in an interval [—p, p] not containing 1. 
Then we could try to choose a polynomial pm where 

1) Pm(1) = 1, 
2) max_p<zr<p |Pm(x)| is as small as possible. 


Since the eigenvalues of pm(R) are pm(A(R)) (see Problem 4.5), these eigen- 
values would be small and so the spectral radius (the largest eigenvalue in 
absolute value) would be small. 

Finding a polynomial pm to satisfy conditions 1) and 2) above is a clas- 
sical problem in approximation theory whose solution is based on Chebyshev 
polynomials. 


DEFINITION 6.15. The mth Chebyshev polynomial is defined by the recurrence 
Tm(£) = 2@Tm—1(x) — Tm—2(x), where To(x) = 1 and Tı (£) = x. 
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Chebyshev polynomials have many interesting properties [238]. Here are a 
few, which are easy to prove from the definition (see Question 6.7). 


LEMMA 6.7. Chebyshev polynomials have the following properties: 
© Tm(1)=1. 
Tile) = 212" +O"), 


majs cos(m - arccos x) if ke <1, 
m\"* \ cosh(m-arccoshx) if |x| > 1. 


Tn(z)| <1 af lal <1. 


The zeros of Ty,(x) are x; = cos((2i — 1)r/(2m)) fori =1,...,m. 
© T(x) = $[(£ + Va? = 1)™ + (x + Va? = 1)-™) if |x| > 1. 
© Tm(1 +€) > 5(1 +mv2e) ife > 0. 


Here is a table of values of Tm(1 + €). Note how fast it grows as m grows, 
even when € is tiny (see Figure 6.6). 


m É 
1074 1078 107? 
10 | 1.0 1.1 2.2 
100 | 2.2 44 6.9- 105 
200 | 8.5 3.8-108 9.4.10 
1000 | 6.9-105 1.3-10!° 1.2.1061 


A polynomial with the properties we want is pm(x) = Tm(£/p)/Tm(1/p). 
To see why, note that pm(1) = 1 and that if x € [—p,pl, then |pm(£)| < 
1/Tm(1/p). For example, if p = 1/(1 + €), then |pm(x)| < 1/Tm(1 + €). As we 
have just seen, this bound is tiny for small € and modest m. 

To implement this cheaply, we use the three-term recurrence Tm(£) = 
2xTm-1(£) — Tm—2(x) used to define Chebyshev polynomials. This means 
that we need only to save and combine three vectors ym, Yn—1, and Ym—2, 
not all the previous £m. To see how this works, let um = 1/Tm(1/p), so 
Pml R) = umIm(R/p) and T = TA PE by the three-term recurrence in 
Definition 6.15. Then 


Ym —L = PmlR)(zo— zx) by equation (6.28) 


mata (2) 02) 
- ne tea (B) 0-0) (E) 


by Definition 6.15 


298 Applied Numerical Linear Algebra 


T_3 T_5 
4 10 
2 5 
o a 
-2 -5 
-4 -10 
-1 0 1 -1 0 1 
T_10 1x io?  T-20 
200 
100 ue 
-5 7190 -0.5 
-200 
—10 -1 
-1 0 1 -1 0 1 -1 0 1 


Fig. 6.6. Graph of Tm(x) versus x. The dotted lines indicate that |Tm(£)| < 1 for 


lal <1. 
R Pm-1(®)(to-— 2) Pm—2(#) (ao — z) 
= fm |2-—- 
pP Um-1 Hm-2 
R r NE 
= fm 2. pe ESS = MOE z| by equation (6.28) 
P Hm-—1 Um—2 
or 
2um R Hm i 
m — Ym-1 Ym—-2 7 dm, 
Um-1 P Hm-2 
where 


Hm-1 \ P Hm-2 
2 = 
= £ pi (: e) + Hm x since x = Re +c 
Um-1 P Um—2 
( 1 2 1 ) 2b 
= Um H z4 C 
Hm PHm-—1 Um—2 PLm-1 
2 
2, gy by the definition of um. 
PHm-1 


This yields the algorithm. 


ALGORITHM 6.7. Chebyshev acceleration of x41 = Rz; + c: 
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Ho = 1; u = P; Yo = £o; Yı = Raot+e 


form = 2,3,... 
= | 
Hm = 1/ (4 ws) 5 
= Hm — Hum Hm 
Ym = Fh RYm-1 Tamas Ym—2 + Phim’ 
end for 


Note that each iteration takes just one application of R, so if this is signif- 
icantly more expensive than the other scalar and vector operations, this algo- 
rithm is no more expensive per step than the original iteration ty41 = R&m+c. 

Unfortunately, we cannot apply this directly to SOR(w) for solving Ax = b, 
because Rsor(w) generally has complex eigenvalues, and Chebyshev accelera- 
tion requires that R have real eigenvalues in the interval |[—p, p]. But we can 
fix this by using the following algorithm. 


ALGORITHM 6.8. SSOR: 


1. Take one step of SOR(w) computing the components of x in the usual 
increasing order: L414, Ti 2,- +24 Lin; 


2. Take one step of SOR(w) computing backwards: Xin, £in—1,.--, Zi,1. 


We will reexpress this algorithm as zi+1 = Ewzi + cu and show that Ew 
has real eigenvalues, so we can use Chebyschev acceleration. 

Suppose A is symmetric as in the model problem and again write A = 
D-—L-U = D(I—L-—U) as in equation (6.19). Since A= AT, U = LT. Use 
equation (6.21) to rewrite the two steps of SSOR as 

Lo tyi = (T -wL) H- w) +wU)zti +j = Loti + 6/2, 

2: Li = (f-w)"((1- w)T+wh)x a +c Uti +c. 

Eliminating x;4 1 yields x4, = Ex; + ĉ, where 


Ey. Uae 
I+ (w— 2)? —wl) 17 — wL)! + (w — 2)(I — wU)! 
+w — 2)(I — wU)THI — wk)“ (T — wU). 


We claim that E„ has real eigenvalues, since it has the same eigenvalues as the 
similar matrix 


(I — wU)E, (I — wt)“ 

PH(0)"T -wL “OU Se) + (w — 2)(I wl) * 
+w — 2)(I — wL)! 

I+ (2- w) (I -wL aon) + (w -2 ror 
+w —2)(I —- wL)t, 


which is clearly symmetric and so must have real eigenvalues. 
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EXAMPLE 6.12. Let us apply SSOR(w) with Chebyshev acceleration to the 

model problem. We need to both choose w and estimate the spectral radius p = 

p(E.). The optimal w that minimizes p is not known but Young [265, 135] has 
2 


shown that the choice w = T+Pda(Ra) 2 is a good one, yielding p(£,,) ~ 1— 
< 


gn: With Chebyshev acceleration the error is multiplied by um ~ Tall tz) < 
2/(1 + m,/Ẹ) at step m. Therefore, to decrease the error by a fixed factor 
< 1 requires m = O(N'/?) = O(n"/*) iterations. Since each iteration has the 
same cost as an iteration of SOR(w), O(n), the overall cost is O(n®/*+). This 
explains the entry for SSOR with Chebyshev acceleration in Table 6.1. 

In contrast, after m steps of SOR(wopt), the error would decrease only by 
(1— +)™. For example, consider N = 1000. Then SOR(wop¢) requires m = 
220 iterations to cut the error in half, whereas SSOR (wopt) with Chebyshev 
acceleration requires only m = 17 iterations. © 


6.6. Krylov Subspace Methods 


These methods are used both to solve Ax = b and to find eigenvalues of A. 
They assume that A is accessible only via a “black-box” subroutine that re- 
turns y = Az given any z (and perhaps y = A’ z if A is nonsymmetric). In 
other words, no direct access or manipulation of matrix entries is used. This 
is a reasonable assumption for several reasons. First, the cheapest nontrivial 
operation that one can perform on a (sparse) matrix is to multiply it by a 
vector; if A has m nonzero entries, matrix-vector multiplication costs m mul- 
tiplications and (at most) m additions. Second, A may not be represented 
explicitly as a matrix but may be available only as a subroutine for computing 
Ax. 


EXAMPLE 6.13. Suppose that we have a physical device whose behavior is 
modeled by a program, which takes a vector x of input parameters and pro- 
duces a vector y of output parameters describing the device’s behavior. The 
output y may be an arbitrarily complicated function y = f(x), perhaps re- 
quiring the solution of nonlinear differential equations. For example, x could 
be parameters describing the shape of a wing and f(x) could be the drag on 
the wing, computed by solving the Navier-Stokes equations for the airflow 
over the wing. A common engineering design problem is to pick the input x 
to optimize the device behavior f(x), where for concreteness we assume that 
this means making f(x) as small as possible. Our problem is then to try to 
solve f(x) = 0 as nearly as we can. Assume for illustration that x and y are 
vectors of equal dimension. Then Newton’s method is an obvious candidate, 
yielding the iteration x+» = a — (Vf(a™))—1f(a™), where V f(c) 
is the Jacobian of f at 2°). We can rewrite this as solving the linear system 
(VF (a™))-6% = f(x) for 5 and then computing 2"+) = gl — g., 
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But how do we solve this linear system with coefficient matrix V f(a”) when 
computing f(a) is already complicated? It turns out that we can compute 
the matrix-vector product (V f(x)) -z for an arbitrary vector z so that we can 
use Krylov subspace methods to solve the linear system. One way to com- 
pute (V f(x)) -z is with divided differences or by using a Taylor expansion to 
see that [f(a + hz) — f(x)|/h ~ (Vf(x))- z. Thus, computing (Vf(x)) - z 
requires two calls to the subroutine that computes f(-), once with argument 
x and once with x + hz. However, sometimes it is difficult to choose h to 
get an accurate approximation of the derivative (choosing h too small results 
in a loss of accuracy due to roundoff). Another way to compute (V f(x))-z 
is to actually differentiate the function f. If f is simple enough, this can be 
done by hand. For complicated f, compiler tools can take a (nearly) arbitrary 
subroutine for computing f(x) and automatically produce another subroutine 
for computing (V f(x)) -z [29]. This can also be done by using the operator 
overloading facilities of C++ or Fortran 90, although this is less efficient. © 


A variety of different Krylov subspace methods exist. Some are suitable for 
nonsymmetric matrices, and others assume symmetry or positive definiteness. 
Some methods for nonsymmetric matrices assume that A’z can be computed 
as well as Az; depending on how A is represented, A’z may or may not be 
available (see Example 6.13). The most efficient and best understood method, 
the conjugate gradient method (CG), is suitable only for symmetric positive 
definite matrices, including the model problem. We will concentrate on CG in 
this chapter. 

Given a matrix that is not symmetric positive definite, it can be difficult 
to pick the best method from the many available. In section 6.6.6 we will 
give a short summary of the other methods available, besides CG, along with 
advice on which method to use in which situation. We also refer the reader to 
the more comprehensive on-line help at NETLIB/templates, which includes a 
book [24] and implementations in Matlab, Fortran, and C++. For a survey of 
current research in Krylov subspace methods, see [15, 105, 134, 212]. 

In Chapter 7, we will also discuss Krylov subspace methods for finding 
eigenvalues. 


6.6.1. Extracting Information about A via Matrix-Vector Multipli- 
cation 


Given a vector b and a subroutine for computing A 2x, what can we deduce 
about A? The most obvious thing that we can do is compute the sequence of 


matrix-vector products yı = b, y2 = Ayı, y3 = Ay2 = A7y1, ..., Yn = AYn-1 = 
A”-1y,, where A is n-by-n. Let K = [y1,y2,,---,Yn]. Then we can write 
A-Kk= [Ay., se: »AYn-1, Ayn] = [y2, s+ Yn, A”yı]. (6.29) 


Note that the leading n — 1 columns of A- K are the same as the trailing 
n — 1 columns of K, shifted left by one. Assume for the moment that K is 
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nonsingular, so we can compute c = —K7~!A"y,. Then 
A- K = K. [|e2,€3,...,€n, =] = K.-C, 


where e; is the ith column of the identity matrix, or 


00. 0 =c 
1 0. 0 — C2 
01. 
K!AK =C = i 
0 
1 —cn 


Note that C is upper Hessenberg. In fact, it is a companion matrix (see sec- 
tion 4.5.3), which means that its characteristic polynomial is p(x) = x” + 
Sa cx'—', Thus, just by matrix-vector multiplication, we have reduced A 
to a very simple form, and in principle we could now find the eigenvalues of A 
by finding the zeros of p(x). 

However, this simple form is not useful in practice, for the following reasons: 


1. Finding c requires n — 1 matrix-vector multiplications by A and then 
solving a linear system with K. Even if A is sparse, K is likely to be 
dense, so there is no reason to expect solving a linear system with K will 
be any easier than solving the original problem Az = b. 


2. K is likely to be very ill-conditioned, so c would be very inaccurately 
computed. This is because the algorithm is performing the power method 
(Algorithm 4.1) to get the columns y; of K, so that y; is converging to 
an eigenvector corresponding to the largest eigenvalue of A. Thus, the 
columns of K tend to get more and more parallel. 


We will overcome these problems as follows: We will replace K with an 
orthogonal matrix Q such that for all k, the leading k columns of K and Q 
span the same the same space. This space is called a Krylov subspace. In 
contrast to K, Q is well conditioned and easy to invert. Furthermore, we will 
compute only as many leading columns of Q as needed to get an accurate 
solution (for Ax = b or Ax = Ax). In practice we usually need very few 
columns compared to the matrix dimension n. 

We proceed by writing K = QR, the QR decomposition of K. Then 


KAK = (R197) A(QR) = C, 


implying 
QTAQ = RCR! =H. 
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Since R and R`} are both upper triangular and C is upper Hessenberg, it is 
easy to confirm that H = RC R7! is also upper Hessenberg (see Question 6.11). 
In other words, we have reduced A to upper Hessenberg form by an orthogonal 
transformation Q. (This is the first step of the algorithm for finding eigenval- 
ues of nonsymmetric matrices discussed in section 4.4.6.) Note that if A is 
symmetric, so is Q7 AQ = H, and a symmetric matrix which is upper Hes- 
senberg must also be lower Hessenberg, i.e., tridiagonal. In this case we write 
QF AQ =T. 

We still need to show how to compute the columns of Q one at a time, 
rather than all of them: Let Q = [q,..-,dn]. Since Q7 AQ = H implies 
AQ = QH, we can equate column j on both sides of AQ = QH, yielding 


j+1 
Aq = X hijqi. 
i=l 


Since the q; are orthonormal, we can multiply both sides of this last equality 
by qi, to get 


j+l 
qp Adj = So hij Gn Gi =hmj for l<m<j 
i=l 


and so 
j 
hj+1jgj+1 = Aggy — Shiai 
i=1 
This justifies the following algorithm. 


ALGORITHM 6.9. The Arnoldi algorithm for (partial) reduction to Hessenberg 
form: 


qı = b/|lbll2 
/* k is the number of columns of Q and H to compute */ 
forj=ltok 
z = Aqj 
Jfori=1 toj 
hig = G2 
z = z — hi jqi 
end for 
hisig = llel 
if hj+1,j = 0, quit 
qj+1 = Z/hħj+1,j 
end for 
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The q; computed by Arnoldi’s algorithm are often called Arnoldi vectors. 
The loop over i updating z can be also be described as applying the modified 
Gram-Schmidt algorithm (Algorithm 3.1) to subtract the components in the 
directions qı through q; away from z, leaving z orthogonal to them. Computing 
qı through qk costs k matrix-vector multiplications by A, plus O(k?n) other 
work. If we stop the algorithm here, what have we learned about A? Let us 
write Q = [Qk, Qu], where Qk = [m,---, qk] and Qu = [@e41,---5 Qn]. Note that 
we have computed only Qk and qk+1; the other columns of Qu are unknown. 
Then 


QAQ: Qf AQu 


H = Q’ AQ = [Qk Qul” A[Qk, Qu] z | QT AQ, QT AQ, 


k n—k 
— k A, Huk 
=", E a J (6.30) 


Note that H;, is upper Hessenberg, because H has the same property. For 
the same reason, Hj, has a single (possibly) nonzero entry in its upper right 
corner, namely, hk+1,k- Thus, H, and Huk are unknown; we know only H; 
and Hku. 

When A is symmetric, H = T is symmetric and tridiagonal, and the Arnoldi 
algorithm simplifies considerably, because most of the h; j are zero: Write 


be bı | 
Bis, Sa l 


Bn-1 An 


T= 


Equating column j on both sides of AQ = QT yields 
Aq; = Bj-1qj-1 + 9495 + Bj 9541. 


Since the columns of Q are orthonormal, multiplying both sides this equation 
by qj yields q;Aq; = aj. This justifies the following version of the Arnoldi 
algorithm, called the Lanczos algorithm. 


ALGORITHM 6.10. The Lanczos algorithm for (partial) reduction to symmetric 
tridiagonal form. 


qı = b/|lbll2; Go = 9, go = 0 


forj=1tok 
z= Aq; 
aj =) z 


z = Z — 0595 — Bj-19j-1 


ĉj = |lzll2 
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if Bj = 0, quit 
qj+1 = 2/8; 
end for 


The q; computed by the Lanczos algorithm are often called Lanczos vectors. 
After k steps of Lanczos, here is what we have learned about A: 


QT AQ = [Qk Qu] Alk, Qul” 
QTAQ, QTAQu 
QFAQ, QEAQu 


T 


k n-k 
—_ k Te Tuk 
~ n—-k Tku To 
Te Ta: 
| i Te | l (6.31) 


Because A is symmetric, we know Tk and Tku = TZ, but not Tu. Tku has a 
single (possibly) nonzero entry in its upper right corner, namely, (3. Note that 
Gy, is nonnegative, because it is computed as the norm of z. 

We define some standard notation associated with the partial factorization 
of A computed by the Arnoldi and Lanczos algorithms. 


DEFINITION 6.16. The Krylov subspace K;,(A,b) is span[b, Ab, A7b,..., A~10]. 


We will write Kẹ instead of K;,(A,b) if A and b are implicit from the context. 
Provided that the algorithm does not quit because z = 0, the vectors Qk 
computed by the Arnoldi or Lanczos algorithms form an orthonormal basis of 
the Krylov subspace Kp. (One can show that Kp has dimension k if and only 
if the Arnoldi or Lanczos algorithm can compute gz without quitting first; see 
Question 6.12.) We also call Hp (or Tk) the projection of A onto the Krylov 
subpace Kx. 

Our goal is to design algorithms to solve Ax = b using only the information 
computed by k steps of the Arnoldi or Lanczos algorithm. We hope that k can 
be much smaller than n, so the algorithms are efficient. 

(In Chapter 7 we will use this same information for find eigenvalues of A. 
We can already sketch how we will do this: Note that if hg+1,& happens to be 
zero, then H (or T) is block upper triangular and so all the eigenvalues of Hy 
are also eigenvalues of H, and therefore also of A, since A and H are similar. 
The (right) eigenvectors of Hy are eigenvectors of H, and if we multiply them 
by Qk, we get eigenvectors of A. When hk+1,k is nonzero but small, we expect 
the eigenvalues and eigenvectors of Hy to provide good approximations to the 
eigenvalues and eigenvectors of A.) 

We finish this introduction by noting that roundoff error causes a num- 
ber of the algorithms that we discuss to behave entirely differently from how 
they would in exact arithmetic. In particular, the vectors qi computed by 
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the Lanczos algorithm can quickly lose orthogonality and in fact often be- 
come linearly dependent. This apparently disastrous numerical instability led 
researchers to abandon these algorithms for several years after their discov- 
ery. But eventually researchers learned either how to stabilize the algorithms 
or that convergence occurred despite instability! We return to these points 
in section 6.6.4, where we analyze the convergence of the conjugate gradient 
method for solving Ax = b (which is “unstable” but converges anyway), and in 
Chapter 7, especially in sections 7.4 and 7.5, where we show how to compute 
eigenvalues (and the basic algorithm is modified to ensure stability). 


6.6.2. Solving Ax = b Using the Krylov Subspace K, 


How do we solve Ax = b, given only the information available from k steps of 
either the Arnoldi or the Lanczos algorithm? 

Since the only vectors we know are the columns of Qk, the only place to 
“look” for an approximate solution is in the Krylov subspace Ką spanned by 
these vectors. In other words, we see the “best” approximate solution of the 


form 
k 


Pp = Sze 0 = Qk: z, where z= [z,..., zy)". 
j=l 
Now we have to define “best.” There are several natural but different 
definitions, leading to different algorithms. We let x = A~‘b denote the true 
solution and rọ = b — Ax, denote the residual. 


1. The “best” x, minimizes ||x;, — x||2. Unfortunately, we do not have 
enough information in our Krylov subspace to compute this zz. 


2. The “best” xy minimizes ||rj||2. This is implementable, and the corre- 
sponding algorithms are called MINRES (for minimum residual) when 
A is symmetric [192] and GMRES (for generalized minimum residual) 
when A is nonsymmetric [213]. 


3. The “best” x2, makes rg L Kz, i.e., Qtr. = 0. This is sometimes called 
the orthogonal residual property, or a Galerkin condition, by analogy to 
a similar condition in the theory of finite elements. When A is symmet- 
ric, the corresponding algorithm is called SYMMLQ [192]. When A is 
nonsymmetric, a variation of GMRES works [209]. 


4. When A is symmetric and positive definite, it defines a norm ||r||4_, = 
(rT A-!r)1/? (see Lemma 1.3). We say the “best” 2 minimizes ||rg|| 4-1. 
This norm is the same as ||7,—2||4. The algorithm is called the conjugate 
gradient algorithm [143]. 


When A is symmetric positive definite, the last two definitions of “best” 
also turn out to be equivalent. 


Iterative Methods for Linear Systems 307 


THEOREM 6.8. Let A be symmetric, Tk = QT AQk, and rg = b — Azz, where 
£k E Kk. If Ty is nonsingular and £g = QT, €11l2|l2, where ext =n Pe eae 6 hae 
then Orr: = 0. If A is also positive definite, then Ty must be nonsingular, and 
this choice of xp also minimizes ||rz||4-1 over all x, E€ Kg. We also have that 


Th = +||rx|l2qn-41- 


Proof. We drop the subscripts k for ease of notation. Let x = QT te: ||bl]2 
and r = b— Ax, and assume that T = QT AQ is nonsingular. We confirm that 
QTr = 0 by computing 


QTr = QT(b— Axr) = QO-QlAx 
= e4|[bll2— Q7 A(QT ‘ei ||b|l2) 
because the first column of Q is b/||bll2 
and its other columns are orthogonal to b 
= e\|[bll2—(Q7AQ)T tei ||b|l2 
= ¢€|[bll2—(T)T~*e1||b|lz_ because Q7 AQ = T 
= 0. 


Now assume that A is also positive definite. Then T must be positive 
definite and thus nonsingular too (see Question 6.13). Let ¢ = x + Qz be 
another candidate solution in K, and let f = b— Aĉ. We need to show that 
||*|| 4-1 is minimized when z = 0. But 


Gal = = ae by definition 
= (r—AQz)’ A`! (r — AQz) 
since f = b — Aĉ = b — A(x + Qz) =r — AQz 
= r7A r" — 2(AQz)" Ar + (AQz)7 A“1(AQz) 
= [rll — 227 Q?r + ||AQz||4-1 
since (AQz)? Ar = z?QTAA!r = 27 Q?r 
= alee T | AQz||ĝ- since Q?r = 0, 


so ||7|| 4-1 is minimized if and only if AQz = 0. But AQz = 0 if and only if 
z = 0 since A is nonsingular and Q has full column rank. 

To show that rg = 4||rx¢||2qQx41, we reintroduce subscripts. Since £k € Kk, 
we must have rp = b— Ax, E Kk+1, SO ry is a linear combination of the columns 
of Qk+1, since these columns span Kk+1. But since Ol re = 0, the only column 
of Qk+1ı to which rz is not orthogonal is qk41- 


6.6.3. Conjugate Gradient Method 


The algorithm of choice for symmetric positive definite matrices is CG. Theo- 
rem 6.8 characterizes the solution xz, computed by CG. While MINRES might 
seem more natural than CG because it minimizes ||r,||2 instead of ||rg|| 4-1, it 
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turns out that MINRES requires more work to implement, is more suscepti- 
ble to numerical instabilities, and thus often produces less accurate answers 
than CG. We will see that CG has the particularly attractive property that 
it can be implemented by keeping only four vectors in memory at one time, 
and not k (qı through qk). Furthermore, the work in the inner loop, beyond 
the matrix-vector product, is limited to two dot products, three “saxpy” op- 
erations (adding a multiple of one vector to another), and a handful of scalar 
operations. This is a very small amount of work and storage. 

Now we derive CG. There are several ways to do this. We will start with 
the Lanczos algorithm (Algorithm 6.10), which computes the columns of the 
orthogonal matrix Q; and the entries of the tridiagonal matrix Tk, along with 
the formula x, = QxT, ‘e1||b||2 from Theorem 6.8. We will show how to 
compute x, directly via recurrences for three sets of vectors. We will keep only 
the most recent vector from each set in memory at one time, overwriting the 
old ones. The first set of vectors are the approximate solutions zg. The second 
set of vectors are the residuals rg = b — Azk, which Theorem 6.8 showed were 
parallel to the Lanczos vectors q,41. The third set of vectors are the conjugate 
gradients pk. The pz are called gradients because a single step of CG can be 
interpreted as choosing a scalar v so that the new solution x, = £k—1 + VPk 
minimizes the residual norm ||rg|| 4-1 = (r7 A~!r,)!/?. In other words, the pp 
are used as gradient search directions. The pz are called conjugate, or more 
precisely A-conjugate, because pi Ap; = 0 if j =k. In other words, the pz are 
orthogonal with respect to the inner product defined by A (see Lemma 1.3). 

Since A is symmetric positive definite, so is Tk = QT AQk (see Ques- 
tion 6.13). This means we can perform Cholesky on Tk to get Tk = Bele = 
Ly DLF, where Ly is unit lower bidiagonal and D, is diagonal. Then using 
the formula for x, from Theorem 6.8, we get 


te = QT; ‘e1ld|l2 

Qu( Ly Dy Ly erlloll2 

= (QeL,")(Dy* Ly *e1llbll2) 

(Pk) (yk), 

where Py = Op, and Yk = D7" L} 'e1||bll2- Write P; = [p15 i325; Øk]. The 


conjugate gradients p; will turn out to be parallel to the columns ĝ; of P,. We 
know enough to prove the following lemma. 


LEMMA 6.8. The columns p; of Py are A-conjugate. In other words, PT AP, 
is diagonal. 


Proof. We compute 
PEAP, = (Qul Y AlQrLg ) = Ly (QK AQ) Lg" = Ly (Th) Ly” 
= Lz" adele be = Dp. 
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Now we derive simple recurrences for the columns of P, and entries of 
yr. We will show that yp—1 = [m,.--,p—1]" is identical to the leading k — 1 
entries of yg = [71,---. 1-1, Ng]! and that P,_, is identical to the leading k—1 
columns of P. Therefore we can let 


Yk-1 


ne | = Pe-1Yk—1 + Bem = k-11 + Peng (6.32) 


Ek = Pp yk = |Pe-ts Bel + | 
be our recurrence for £k. 
The recurrence for the nk is derived as follows. Since 7;_ is the leading 
(k—1)-by-(k —1) submatrix of Tk, L;,—1 and Dz_ are also the leading (k — 1)- 
by-(k — 1) submatrices of L and Dx, respectively: 


| ay Py | 
Th = fen o E 
l Gr-1 
Be-1 Qk 
ee 
1 dy 1 d 
= l ; or . ly 
dk—ı 
lk 1 1 dk lk—1 1 
L L i 
= k—1 - di . k-1 
= | lg-qer 1 | diag (Dj,-1, dx) | ln—1éF_ 4 1 | ’ 
where êf , = [0,...,0,1] has dimension k — 1. Similarly, D74 and Lti are 


also the leading (k — 1)-by-(k — 1) submatrices of D,' = diag( D71}, d; ') and 


L! 
pee k-1 
wta] 


respectively, where the details of the last row « do not concern us. This means 
that yp_1 = Dz L141 |lblla, where é; has dimension k — 1, is identical to the 
leading k — 1 components of 


D72 po 
= -l1l;-1 — k—1 : k—1 l 
yk = DOL, eillble= | qz! | | a | €1||bllo 
Dy Ly 41llblle | 7 | Yk- | | 
Nk Nk 
Now we need a recurrence for the columns of P, = [(P1,---, Pkl]. Since Li j 


is upper triangular, so is De and it forms the leading (k — 1)-by-(k — 1) 
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submatrix of ee Therefore Pera is identical to the leading k — 1 columns of 


~ = he, 
Py = QL” = [Qr ak] | E 1 


| = [Qk L; Pr] = (Peay Pkl. 
From Ê, = Quel? we get P,LT = Qk or, equating the kth column on both 
sides, the recurrence 


Pk = Uk — lk-1Pk-1- (6.33) 


Altogether, we have recursions for g, (from the Lanczos algorithm), for 
pr (from equation (6.33)), and for the approximate solution x; (from equa- 
tion (6.32)). All these recursions are short; i.e., they require only the previous 
iterate or two to implement. Thus, they together provide the means to com- 
pute x, while storing a small number of vectors and doing a small number of 
dot products, saxpys, and scalar work in the inner loop. 

We still have to simplify these recursions slightly to get the ultimate CG 
algorithm. Since Theorem 6.8 tells us that rą and qz41 are parallel, we can 
replace the Lanczos recurrence for qk+}1 with the recurrence rg = b — Atk 
or equivalently rg = rk-1 — Np Ap, (gotten from multiplying the recurrence 
Lk = Tk-1 + Nee by A and subtracting from b = b). This yields the three 
vector recurrences 


Tk = TR SADE, (6.34) 
Lk = Le-1+ Pr from equation (6.32), (6.35) 
Pk = Gk —lk-1Pr-1 from equation (6.33). (6.36) 
In order to eliminate qk, substitute qk = re—1/|\rx—i|l2 and pr = ||rx—1\lox 


into the above recurrences to get 


k 
Tk = Tk- e Ape 
lrk-1ll2 
= lk-1— VpADk, (6.37) 
Nk 
Tk = Lk-1 +r r Pk 
lre-ill2 
= Zk-1 + VkPk, (6.38) 
E Ilrx—1llote—1 
Pk = Tki- ~ i Pki 
lir-2ll2 
= Tk-1 + Uk ' Pk-1- (6.39) 


We still need formulas for the scalars vk and uk. As we will see, there are 
several equivalent mathematical expression for them in terms of dot products 
of vectors computed by the algorithm. Our ultimate formulas are chosen to 
minimize the number of dot products needed and because they are more stable 
than the alternatives. 
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To get a formula for vz, first we multiply both sides of equation (6.39) on 
the left by pi A, and use the fact that py and pp—ı are A-conjugate (Lemma 6.8) 
to get 


pi Apr = pi Arp—1 +0= ri_, Apr. (6.40) 


Then, multiply both sides of equation (6.37) on the left by rj_, and use the 
fact that ri irk = 0 (since the r; are parallel to the columns of the orthogonal 
matrix Q) to get 


= Tp-1Tk-1 
TE APE 
ri Trk- 
= D by equation (6.40). (6.41) 
Pk APk 


(Equation (6.41) can also be derived from a property of vk in Theorem 6.8, 
namely, that it minimizes the residual norm 


lIrella-1 = rE ATT 
= (rp_1 —YpApp)? A" (rp_1 — VeApp) by equation (6.37) 
Tk-14 7 rE 1 — 2VePETk—-1 + VRDI ADE: 


This expression is a quadratic function of vz, so it can be easily minimized by 
setting its derivative with respect to vz to zero and solving for vz. This yields 


PETRI 

PL APR 

= (Te-1 + Hk Pe—1) Te- by equation (6.39) 
Pk Apk 

TEUS 

pi Apr ” 


Vk = 


where we have used the fact that Ppi_ifk-1 = 0, which holds since rķ—1ı is 
orthogonal to all vectors in Kz_1, including pz_1.) 

To get a formula for ug, multiply both sides of equation (6.39) on the left 
by pA and use the fact that py and pp—ı are A-conjugate (Lemma 6.8) to 
get 

Py_-1ATR-1 


f (6.42) 
Dee 1 Apr—1 


Hk = 
The trouble with this formula for uk is that it requires another dot product, 
pi_,Arp—1, besides the two required for vp. So we will derive another formula 
requiring no new dot products. 
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We do this by deriving an alternate formula for vg: Multiply both sides of 
equation (6.37) on the left by TE; again use the fact that ri irk = 0, and solve 
for vk to get 

T 
TkTk 
Vk = — : (6.43) 
ri App 


Equating the two expressions (6.41) and (6.43) for vķ—-ı (note that we 
have subtracted 1 from the subscript), rearranging, and comparing to equa- 


tion (6.42) yield our ultimate formula for juz: 
TEE _PyaAre-1 
pi, APr—-1 


T 
Tk—1ı"k—1 

= y (6.44) 
Tk—2"k-—2 


Combining recurrences (6.37), (6.38), and (6.39) and formulas (6.41) and 
(6.44) yields our final implementation of the conjugate gradient algorithm. 


ALGORITHM 6.11. Conjugate gradient algorithm: 


k 0; To 0; TO b; Pl b; 


repeat 
k=k+1 
z= A- Dk 


Vk = (Thy? k-1)/ (DE 2) 
Lk = LR-1 + VkPk 
Tk = Tk—1 — VEZ 
bea = (ri rk)/(rk1rk-1) 
Pk+1 = Tk t Hk+1Pk 
until ||rz\|2 is small enough 


The cost of the inner loop for CG is one matrix-vector product z = A- Pk, 
two inner products (by saving the value of TET k from one loop iteration to 
the next), three saxpys, and a few scalar operations. The only vectors that 
need to be stored are the current values of r, x, p, and z = Ap. For more 
implementation details, including how to decide if “||r;||2 is small enough,” see 
NETLIB/templates/templates.html. 


6.6.4. Convergence Analysis of the Conjugate Gradient Method 


We begin with a convergence analysis of CG that depends only on the condition 
number of A. This analysis will show that the number of CG iterations needed 
to reduce the error by a fixed factor less than 1 is proportional to the square 
root of the condition number. This worst-case analysis is a good estimate for 
the speed of convergence on our model problem, Poisson’s equation. But it 
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severely underestimates the speed of convergence in many other cases. After 
presenting the bound based on the condition number, we describe when we 
can expect faster convergence. 

We start with the initial approximate solution x9 = 0. Recall that x, 
minimizes the AT t-norm of the residual rọ = b— Az, over all possible solutions 
£k E K,(A,b). This means x, minimizes 


lb — Az||,-1 = f(z) = (b — Az)? At (b — Az) = (z — z2)" A(z — z) 


over all z € Kk = span[b, Ab, A?b, ..., A*~1b]. Any z € K(A, b) may be written 
Z= en 2 l aj AD b = pp—-ı(A)b = pp-ı(A)Az, where pp_1(€) = we i ajé isa 
polynomial of degree k — 1. Therefore, 


f(z) = [(1—pp_1(A)A) x]? A[U — pe_1(A)A)2] 
= (qx(A)x)" A(qe(A)z) 
x" qp(A).Aq,(A)x, 


where q,(€) = 1 — pe_1(§) - € is a degree-k polynomial with q,(0) = 1. Note 
that (qk(A))T = qz(A) because A = AT. Letting Qg be the set of all degree-k 
polynomials which take the value 1 at 0, this means 

(pe) man ite) = a "an( A) Agg(A)@- (6.45) 


To simplify this expression, write the eigendecomposition A = QAQ? and let 
Q? x = y so that 


f(r) = min f(z) = min 2 (g(QAQ")) (QAQ (g (QAQ"))z 
= min s" (Qan(A)Q")(QAQ*)(Qae(ANQ")x 


qkEQ 
= min y" g(AAglA)y 


qIkEQk 


= min y” -diag(qe(Ai)Aige(Ai)) y 
qkEQk 


IA 


aia, ( rasa?) Yaa 
(2 (mo?) 
= min ( max saa f (20) 


dk€ Qk NENA) 


since zo = 0 implies f(29) = zT Ar = yf Ay = Xi] yi? Ai. Therefore, 


Irel-  f (ae) 
= < min max Ai 
lrolļĝ -1 f(zo) T E AEA(CA) (gi ( )) 
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or 


llr ll 4-1 
rola = eb ey. ee: 

We have thus reduced the question of how fast CG converges to a question 
about polynomials: How small can a degree-k polynomial q,(€) be when € 
ranges over the eigenvalues of A, while simultaneously satisfying q,(0) = 1? 
Since A is positive definite, its eigenvalues lie in the interval [Amin, Amax], where 
0 < Amin < Amax, so to get a simple upper bound we will instead seek a degree- 
k polynomial g,(€) that is small on the whole interval [Amin, Amax] and 1 at 
0. A polynomial g,(€) that has this property is easily constructed from the 
Chebyshev polynomials T;,(€) discussed in section 6.5.6. Recall that |7},(€)| < 1 
when || < 1 and increases rapidly when || > 1 (see Figure 6.6). Now let 


a = Amax F Amin as 2E Amax T Amin 
aK(§) p Th ( Amax = Amin )/n (>= Ei Amin ) 


It is easy to see that g(0) = 1, and if £ € [Amin, Amax], then 


Ae <1 
Amax — Amin =a 
so 

Ilr l|a-1 

"<< min) max rj 

olla = ES me, el 

1 1 1 
== (6.46) 


where «K = Amax/Amin is the condition number of A. 

If the condition number « is near 1, 1 + 2/(« — 1) is large, 1/T,(1 + 4) 
is small, and convergence is rapid. If « is large, convergence slows down, with 
the convergence rate 

1 2 


< : 
Trea tea 


EXAMPLE 6.14. For the N-by-N model problem, x = O(N7?), so after k steps 
of CG the residual is multiplied by about (1 — O(N~!))*, the same as SOR 
with optimal overrelaxation parameter w. In other words, CG takes O(N) = 
O(n'/?) iterations to converge. Since each iteration costs O(n), the overall cost 
is O(n3/2). This explains the entry for CG in Table 6.1. © 


This analysis using the condition number does not explain all the impor- 
tant convergence behavior of CG. The next example shows that the entire 
distribution of eigenvalues of A is important, not just the ratio of the largest 
to the smallest one. 
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Relative Residual vs. Number of CG Iterations 


10 1 | 1 i | 1 1 1 1 
0 20 40 60 80 100 120 140 160 180 200 


Fig. 6.7. Graph of relative residuals computed by CG. 


EXAMPLE 6.15. Let us consider Figure 6.7, which plots the relative residual 
Irx|l2/|lrollg at each CG step for eight different linear systems. The relative 
residual ||r;||2/||ro||2 measures the speed of convergence; our implementation 
of CG terminates when this ratio sinks below 10713, or after k = 200 steps, 
whichever comes first. 


All eight linear systems shown have the same dimension n = 104 and 
the same condition number «Kk ~% 4134, yet their convergence behaviors are 
radically different. The uppermost (dash-dot) line is 1/T,(1 + >), which 
inequality (6.46) tells us is an upper bound on ||rg||_4-1/||ro|| 4-1. It turns out 
the graphs of ||rz||2/||roll2 and the graphs of ||rg||_4-1/||ro|| 4-1 are nearly the 
same, so we plot only the former, which are easier to interpret. 


The solid line is ||rx||2/||ro|l2 for Poisson’s equation on a 100-by-100 grid 
with a random right-hand side b. We see that the upper bound captures its 
general convergence behavior. The seven dashed lines are plots of ||rx||2/||roll2 
for seven diagonal linear systems D;x = b, numbered from Dı on the left to 
D7 on the right. Each D; has the same dimension and condition number as 
Poisson’s equation, so we need to study them more closely to understand their 
differing convergence behaviors. 

We have constructed each D; so that its smallest m; and largest m; eigen- 
values are identical to those of Poisson’s equation, with the remaining n — 2m, 
eigenvalues equal to the geometric mean of the largest and smallest eigenval- 
ues. In other words, D; has only d; = 2m; + 1 distinct eigenvalues. We let 
ki denote the number of CG iterations it takes for the solution of D;x = b to 
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reach ||rz||2/||ro||2 < 10713. The convergence properties are summarized in the 
following table: 


Example number 7/1 2 3 4 5 6 7 
Number of distinct eigenvalues | d; | 3 11 41 81 201 401 5000 
Number of steps to converge ki |} 3 11 27 59 94 134 > 200 


We see that the number k; of steps required to converge grows with the number 
di of distinct eigenvalues. Dy has the same spectrum as Poisson’s equation, 
and converges about as slowly. 

In the absence of roundoff, we claim that CG would take exactly ki = di 
steps to converge. The reason is that we can find a polynomial qa; (£) of degree 
di that is zero at the eigenvalues a; of A, while qg,(0) = 1, namely, 


di 
Lila =£) 
di : 
Lj (a) 
Equation (6.45) tells us that after d; steps, CG minimizes ||ra; 4-1 = f (£a) 


over all possible degree-d; polynomials equaling 1 at 0. Since qq, is one of those 
polynomials and qq,(A) = 0, we must have ||rq,||?,-1 =0, or ra, =0. © 


da; (£) = 


One lesson of Example 6.15 is that if the largest and smallest eigenvalues 
of A are few in number (or clustered closely together), then CG will converge 
much more quickly than an analysis based just on A’s condition number would 
indicate. 

Another lesson is that the behavior of CG in floating point arithmetic can 
differ significantly from its behavior in exact arithmetic. We saw this because 
the number d; of distinct eigenvalues frequently differed from the number k; 
of steps required to converge, although in theory we showed that they should 
be identical. Still, d; and k; were of the same order of magnitude. 

Indeed, if one were to perform CG in exact arithmetic and compare the 
computed solutions and residuals with those computed in floating point arith- 
metic, they would very probably diverge and soon be quite different. Still, as 
long as A is not too ill-conditioned, the floating point result will eventually 
converge to the desired solution of Ax = b, and so CG is still very useful. 
The fact that the exact and floating point results can differ dramatically is 
interesting but does not prevent the practical use of CG. 

When CG was discovered, it was proven that in exact arithmetic it would 
provide the exact answer after n steps, since then rn+ı would be orthogonal 
to n other orthogonal vectors rı through rn, and so must be zero. In other 
words, CG was thought of as a direct method rather than an iterative method. 
When convergence after n steps did not occur in practice, CG was considered 
unstable and then abandoned for many years. Eventually it was recognized as 
a perfectly good iterative method, often providing quite accurate answers after 
k <n steps. 
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Recently, a subtle backward error analysis was devised to explain the ob- 
served behavior of CG in floating point and explain how it can differ from 
exact arithmetic [121]. This behavior can also include long “plateaus” in the 
convergence, with ||r;||2 decreasing little for many iterations, interspersed with 
periods of rapid convergence. This behavior can be explained by showing that 
CG applied to Ax = b in floating point arithmetic behaves exactly like CG 
applied to A% = b in exact arithmetic, where A is close to A in the following 
sense: A has a much larger dimension than A, but A’s eigenvalues all lie in 
narrow clusters around the eigenvalues of A. Thus the plateaus in convergence 
correspond to the polynomial g, underlying CG developing more and more 
zeros near the eigenvalues of A lying in a cluster. 


6.6.5. Preconditioning 


In the previous section we saw that the convergence rate of CG depended on 
the condition number of A, or more generally the distribution of A’s eigenval- 
ues. Other Krylov subspace methods have the same property. Preconditioning 
means replacing the system Ax = b with the system M~!Ax = M~'!b, where 
M is an approximation to A with the properties that 


1. M is symmetric and positive definite, 
2. M~—'A is well conditioned or has few extreme eigenvalues, 
3. Mz = b is easy to solve. 


A careful, problem-dependent choice of M can often make the condition num- 
ber of M~1A much smaller than the condition number of A and thus accelerate 
convergence dramatically. Indeed, a good preconditioner is often necessary for 
an iterative method to converge at all, and much current research in iterative 
methods is directed at finding better preconditioners (see also section 6.10). 

We cannot apply CG directly to the system M~!Ax = Mtb, because 
M~'A is generally not symmetric. We derive the preconditioned conjugate 
gradient method as follows. Let M = QAQT be the eigendecomposition of 
M, and define M!/2 = QA!/2Q?. Note that M!/2 is also symmetric positive 
definite, and (M1/?)? = M. Now multiply M~! Ax = M~!d through by M1/? 
to get the new symmetric positive definite system (M~!/2AM7—!/?)(M1/2x) = 
M~'/2b, or A& = b. Note that A and M~!A have the same eigenvalues since 
they are similar (M~!A = M~'/24M1/2). We now apply CG implicitly to the 
system A% = b in such a way that avoids the need to multiply by M~!/?. This 
yields the following algorithm. 


ALGORITHM 6.12. Preconditioned CG algorithm: 


k 0; To 0; TO b; Pl Mtb; Yo = Mtro 
repeat 
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k=k+1 
z= Á: Pk 
Vk = (yp_1?k—-1)/ (PE z) 
Tk = Tk—1 + VED 
Tk = Tk-1 — VEZ 
Yk = Mtr, 
Hk+1 = (yp Tk) / (Yer? k-1) 
Pk+1 = Yk + Hk+1Pk 
until ||rkl|2 is small enough 


THEOREM 6.9. Let A and M be symmetric positive definite, Â = M72 AM"?, 
and b = M~*/2b. The CG algorithm applied to Âĉ = b, 


k 0; ĉo 0; fo b; Pi b; 


repeat 
k=k+1 
z= Â- fr 


Dy = (FE_1FR-1)/ (Hf 2) 


k = Êk—1 + ÛÔkPk 
Pk = Pk-1 — DpZ 
Aner = (Pk fk) /(Pk-1°k-1) 
Pk+1 = Ôk + Mk41Pk 
until ||Fkl|2 is small enough 


and Algorithm 6.12 are related as follows: 


hk = Hk, 
Ye = Dk, 
2 = Mz, 
êp = Mtp, 
tj = M Wig 
Be = Mp, 
Therefore, x, converges to MT! times the solution of At = b, i.e., to 


M-12 Â-1$ = A71}, 


For a proof, see Question 6.14. 

Now we describe some common preconditioners. Note that our twin goals of 
minimizing the condition number of M~!A and keeping Mz = b easy to solve 
are in conflict with one another: Choosing M = A minimizes the condition 
number of M~!A but leaves Mx = b as hard to solve as the original problem. 
Choosing M = I makes solving Max = b trivial but leaves the condition number 
of M~-'A unchanged. Since we need to solve Mz = b in the inner loop of the 
algorithm, we restrict our discussion to those M for which solving Mz = b is 
easy, and describe when they are likely to decrease the condition number of 
MIA. 
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e If A has widely varying diagonal entries, we may use the simple diagonal 
preconditioner M = diag( a11, ..., ann ). One can show that among 
all possible diagonal preconditioners, this choice reduces the condition 
number of M~'A to within a factor of n of its minimum value [242]. 
This is also called Jacobi preconditioning. 


e As a generalization of the first preconditioner, let 


be a block matrix, where the diagonal blocks A;; are square. Then among 
all block diagonal preconditioners 


My 


Mib 


where M;; and A; have the same dimensions, the choice Mi = Aj 
minimizes the condition number of M~!/2AM~!/2 to within a factor 
of b [68]. This is also called block Jacobi preconditioning. 


e Like Jacobi, SSOR can also be used to create a (block) preconditioner. 


e An incomplete Cholesky factorization LL’ of A is an approximation 
A ~ LL’, where L is limited to a particular sparsity pattern, such as 
the original pattern of A. In other words, no fill-in is allowed during 
Cholesky. Then M = LL’ is used. (For nonsymmetric problems, there 
is a corresponding incomplete LU preconditioner. ) 


e Domain decomposition is used when A represents an equation (such as 
Poisson’s equation) on a physical region 2. So far, for Poisson’s equation, 
we have let Q be the unit square. More generally, the region Q may be 
broken up into disjoint (or slightly overlapping) subregions Q = U,;Q,;, 
and the equation may be solved on each subregion independently. For 
example, if we are solving Poisson’s equation and if the subregions are 
squares or rectangles, these subproblems can be solved very quickly using 
FFTs. Solving these subproblem corresponds to a block diagonal M (if 
the subregions are disjoint) or a product of block diagonal M (if the 
subregions overlap). This is discussed in more detail in section 6.10. 


A number of these preconditioners have been implemented in the software 
packages PETSc [230] and PARPRE (NETLIB/scalapack/parpre.tar.gz). 
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6.6.6. Other Krylov Subspace Algorithms for Solving Az = b 


So far we have concentrated on the symmetric positive definite linear sys- 
tems and minimized the A~!-norm of the residual. In this section we describe 
methods for other kinds of linear systems and offer advice on which method to 
use, based on simple properties of the matrix. See Figure 6.8 for a summary, 
[15, 105, 134, 212] and NETLIB/templates for details, and NETLIB/templates 
in particular for more comprehensive advice on choosing a method, along with 
software. 

Any system Az = b can be changed to a symmetric positive definite system 
by solving the normal equations A’ Ar = ATb (or AATy = b, zx = A’ y). 
This includes the least squares problem min, || Ax — 6||2. This lets us use CG, 
provided that we can multiply vectors both by A and AT. Since the condition 
number of ATA or AAT is the square of the condition number of A, this 
method can lead to slow convergence if A is ill conditioned but is fast if A is 
well-conditioned (or AT A has a “good” distribution of eigenvalues, as discussed 
in section 6.6.4). 

We can minimize the two-norm of the residual instead of the A~‘-norm 
when A is symmetric positive definite. This is called the minimum residual 
algorithm, or MINRES [192]. Since MINRES is more expensive than CG and is 
often less accurate because of numerical instabilities, it is not used for positive 
definite systems. But MINRES can be used when the matrix is symmetric 
indefinite, whereas CG cannot. In this case, we can also use the SYMMLQ 
algorithm of Paige and Saunders [192], which produces a residual rg L K;(A, b) 
at each step. 

Unfortunately, there are few matrices other than symmetric matrices where 
algorithms like CG exist that simultaneously 


1. either minimize the residual ||rx,||2 or keep it orthogonal rg L Kx, 


2. require a fixed number of dot products and saxpy’s in the inner loop, 
independent of k. 


Essentially, algorithms satisfying these two properties exist only for matrices 
of the form e® (T + cI), where T = T? (or TH = (HT)! for some symmetric 
positive definite H), 6 is real, and ø is complex [100, 249]. For these symmetric 
and special nonsymmetric A, it turns out we can find a short recurrence, as 
in the Lanczos algorithm, for computing an orthogonal basis [q,...,q%] of 
K,.(A, b). The fact that there are just a few terms in the recurrence for updating 
qg means that it can be computed very efficiently. 

This existence of short recurrences no longer holds for general nonsym- 
metric A. In this case, we can use Arnoldi’s algorithm. So instead of the 
tridiagonal matrix Te = QĮAQk, we get a fully upper Hessenberg matrix 
H; = QF AQ;. The GMRES algorithm (generalized minimum residual) uses 
this decomposition to choose £k = Qkyk E Kx (A, b) to minimize the residual 


IIrell2 = |b- Axelle 
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= |]0— AQkyrll2 
= ||b—(QHQ*)Qkynll2 by equation (6.30) 
= ||Q7b— HQ*Qrygll2 since Q is orthogonal 


— e b — . 
1 [Bll | Ha Hy o Ji, 
by equation (6.30) and since the first column of 


Q= [Qi Qu] bloli 
ei llbll2 — | E 


Since only the first row of Hj, is nonzero, this is a (k+1)-by-k upper Hessenberg 
least squares problem for the entries of yg. Since it is upper Hessenberg, the QR 
decomposition needed to solve it can be accomplished with k Givens rotations, 
at a cost of O(k?) instead of O(k*). Also, the storage required is O(kn), since 
Qk must be stored. One way to limit the growth in cost and storage is to 
restart GMRES, i.e., taking the answer x, computed after k steps, restarting 
GMRES to solve the linear system Ad = rz, = b — Azz, and updating the 
solution to get x, +d; this is called GMRES(k). Still, even GMRES(k) is more 
expensive than CG, where the cost of the inner loop does not depend on & at 
all. 

Another approach to nonsymmetric linear systems is to abandon comput- 
ing an orthonormal basis of K,(A,b) and compute a nonorthonormal basis that 
again reduces A to (nonsymmetric) tridiagonal form. This is called the non- 
symmetric Lanczos method and requires matrix-vector multiplication by both 
A and A’. This is important because A’ z is sometimes harder (or impossible) 
to compute (see Example 6.13). The advantage of tridiagonal form is that it is 
much easier to solve with a tridiagonal matrix than a Hessenberg one. The dis- 
advantage is that the basis vectors may be very ill conditioned and may in fact 
fail to exist at all, a phenomenon called breakdown. The potential efficiency 
has led to a great deal of research on avoiding or alleviating this instability 
(look-ahead Lanczos) and to competing methods, including biconjugate gra- 
dients and quasi-minimum residuals. There are also some versions that do 
not require multiplication by A’, including conjugate gradients squared, and 
bi-conjugate gradient stabilized. No one method is best in all cases. 

Figure 6.8 shows a decision tree giving simple advice on which method to 
try first, assuming that we have no other deep knowledge of the matrix A (such 
as that it arises from the Poisson equation). 


2 


6.7. Fast Fourier Transform 


In this section 7 will always denote y—1. 
We begin by showing how to solve the two-dimensional Poisson’s equa- 
tion in a way requiring multiplication by the matrix of eigenvectors of Ty. 
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A symmetric? 


No Yes 
Al available? A definite? 
No Yes No Yes 
Is storage Is A well- Is A well- Largest and smallest 
expensive? conditioned? conditioned? eigenvalues known? 
No Yes No Yes Yes No No Yes 
Try GMRES Try CGS or Try QMR Try CG on Try MINRES Try CG Try CG with 
Bi-CGStab or normal equations Chebyshev Accel. 
GMRES(k) 


Fig. 6.8. Decision tree for choosing an iterative algorithm for Ax = b. Bi-CGStab = 
bi-conjugate gradient stabilized. QMR = quasi-minimum residuals. 


A straightforward implementation of this matrix-matrix multiplication would 
cost O(N?) = O(n?/*) operations, which is expensive. Then we show how 
this multiplication can be implemented using the FFT in only O(N? log N) = 
O(nlog n) operations, which is within a factor of logn of optimal. 

This solution is a discrete analogue of the Fourier series solution of the 
original differential equation (6.1) or (6.6). Later we will make this analogy 
more precise. 

Let Ty = ZAZ™ be the eigendecomposition of Ty, as defined in Lemma 6.1. 
We begin with the formulation of the two-dimensional Poisson’s equation in 
equation (6.11): 

TNV + Vin = h?7F. 


Substitute Ty = ZAZ? and multiply by the ZT on the left and Z on the right 
to get 
TP GRE VERA VGAZL ZS ZT (k F)Z 


or 


AV! 4.V'A = h2F", 
where V’ = ZTV Z and F’ = ZTF Z. The (j,k)th entry of this last equation is 
(AV' + V'A) jk = AjVGK + UiRAR = h? Fik 
which can be solved for Uig to get 


2 fI 
1 h" fir 


a. 
This yields the first version of our algorithm. 


ALGORITHM 6.13. Solving the two-dimensional Poisson’s equation using the 
eigendecomposition Ty = ZAZT: 
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1) F'=-Z'FZ 

2 fI 
2) For allj and k, vi, = we 
3) V= ZV'ZT 


The cost of step 2 is 3N? = 3n operations, and the cost of steps 1 and 3 
is 4 matrix-matrix multiplications by Z and ZT = Z, which is 8N? = 8n?/2 
operations using a conventional algorithm. In the next section we show how 
multiplication by Z is essentially the same as computing a discrete Fourier 
transform, which can be done in O(N?log N) = O(nlogn) operations using 
the FFT. 

(Using the language of Kronecker products introduced in section 6.3.3, and 
in particular the eigendecomposition of Tx from Proposition 6.1, 


Tvxn =1@Twt+Tn @1=(Z8Z)-(I@A+A@Ql)-(Z@Z)‘, 
we can rewrite the formula justifying Algorithm 6.13 as follows: 


vec(V) Tnxn) 1+ vec(h?F) 
(Z@Z)-(I@A+A@QI)-(Z@Z)*")'- vec(h?F) 
Z@Z)*.(T@A+A@QI)!-(Z@Z)'.vec(h?F) 


Z®Z)-(IQ@A+A@QI)1.(Z? & ZT) -vec(h?F). (6.47) 


( 
( 
( 
( 


We claim that doing the indicated matrix-vector multiplications from right to 
left is mathematically the same as Algorithm 6.13; see Question 6.9. This also 
shows how to extend the algorithm to Poisson’s equation in higher dimensions. ) 


6.7.1. The Discrete Fourier Transform 


In this subsection, we will number the rows and columns of matrices from 0 to 
N — 1 instead of from 1 to N. 


DEFINITION 6.17. The discrete Fourier transform (DFT) of an N-vector x 

is the vector y = ®x, where ® is an N-by-N matrix defined as follows. Let 
—2ri $ 

w=e N =cos ca —i-sin a, a principal Nth root of unity. Then oj, = ge”, 


The inverse discrete Fourier transform (IDFT) of y is the vector x = ®~ ty. 


a 


LEMMA 6.9. IN 


® is a symmetric unitary matriz, so P7! = xo = we. 
Proof. Clearly ® = T, so ® = ©*, and we need only show ®- ®@ = N.I. 
A Ve N-1 TIIE N—1  lk~kj _ N-1,k(l-j) a ies 
Compute (SP); = J p-o dinbey = Xpo WW = Vy w , since @ = 
wt. If l= j, this sum is clearly N. If 1 = j, it is a geometric sum with value 
1—wN (l-3) _ 0. si N _ 
a ys TY, since W 1. 
Thus, both the DFT and IDFT are just matrix-vector multiplications and 
can be straightforwardly implemented in 2N? flops. This operation is called 
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a DFT because of its close mathematical relationship to two other kinds of 
Fourier analyses: 


the Fourier transform EC) = foe e`? F(x) dx 

and its inverse f w =a ae a L )d¢ 

the Fourier series = i e mije F f(z 

where f is periodic on [0,1] 

and its inverse Ta) = oe etnije; 

the DFT = (® th= Yizo I eT 2Tiik/N y, 
and its inverse = (lyk = + PON 1 etmi N y, 


We will make this close relationship more concrete in two ways. First, we 
will show how to solve the model problem using the DFT and then the original 
Poisson’s equation (6.1) using Fourier series. This example will motivate us to 
find a fast way to multiply by ®, because this will give us a fast way to solve 
the model problem. This fast way is called the fast Fourier transform or FFT. 
Instead of 2N? flops, it will require only about 3N log, N flops, which is much 
less. We will derive the FFT by stressing a second mathematical relationship 
shared among the different kinds of Fourier analyses: reducing convolution to 
multiplication. 

In Algorithm 6.13 we showed that to solve the discrete Poisson equation 
TNV + VTy = h?F for V required the ability to multiply by the N-by-N 
matrix Z, where 

2, n(j+1)(k+1) 
Nei NEL 
(Recall that we number rows and columns from 0 to N — 1 in this section.) 
Now consider the (2N + 2)-by-(2N + 2) DFT matrix ®, whose j,k entry is 


—2Qnijk\ _ —nijk\ Tjk >.. Tjk 
exp IN +2 = exp Nal = OOS Font Og 


Thus the N-by-N matrix Z consists of —,/ va times the imaginary part of 


Zik = 


the second through (N + 1)st rows and columns of ®. So if we can multiply 
efficiently by ® using the FFT, then we can multiply efficiently by Z. (To 
be most efficient, one modifies the FFT algorithm, which we describe below, 
to multiply by Z directly; this is called the fast sine transform. But one 
can also just use the FFT.) Thus, multiplying ZF quickly requires an FFT- 
like operation on each column of F, and multiplying FZ requires the same 
operation on each row. (In three dimensions, we would let V be an N-by-N- 
by-N array of unknowns and apply the same operation to each of the 3N? 
sections parallel to the coordinate axes.) 


6.7.2. Solving the Continuous Model Problem Using Fourier Series 


We now return to numbering rows and columns of matrices from 1 to N. 
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In this section we show how the algorithm for solving the discrete model 
problem is a natural analogue of using Fourier series to solve the original 
differential equation (6.1). We will do this for the one-dimensional model 
problem. 

Recall that Poisson’s equation on [0,1] is aoe = f(x) with boundary 
conditions v(0) = v(1). To solve this, we will expand v(x) in a Fourier series: 
v(x) = yay a;sin(jrz). (The boundary condition v(1) = 0 tells us that no 
cosine terms appear.) Plugging v(x) into Poisson’s equation yields 


Yo 95 (9x?) sinlar) = f(a). 


j=l 


Multiply both sides by sin(k72), integrate from 0 to 1, and use the fact that 
fo sin(j7x) sin(krx)dx = 0 if j = k and 1/2 if j =k to get 


De fis 
Qk = za | sin(kra) f (x)dx 


and finally 


oO 


1 
v(x) = ` (=f sin(iny)f(u)du) sin(j7a). (6.48) 


j=l 
Now consider the discrete model problem Tyv = h? f. Since Ty = ZAZ’, 
we can write v = Loney = ZA"Z" h?f, so 


N 2 


ee oe sah a E (Z" P); (6.49) 
i be Pee ONL | AVN Nebel 2i ' 


j=l 


where 


2 z A 2 rjl 
—" (ZTf) = i 
Vingi? P Waid sahl) 


Q 


1 
2 / sin (my) f(y)dy, 


since the last sum is just a Riemann sum approximation of the integral. Fur- 
thermore, for small j, recall that ua x me So we see how the solution of the 
discrete problem (6.49) approximates the solution of the continuous problem 
(6.48), with multiplication by ZT corresponding to multiplication by sin(jrz) 
and integration, and multiplication by Z corresponding to summing the differ- 
ent Fourier components. 
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6.7.3. Convolutions 


The convolution is an important operation in Fourier analysis, whose definition 
depends on whether we are doing Fourier transforms, Fourier series, or the 
DFT: 


Fourier transform (f * g)(z) = [S - x Ny )dy 

Fourier series (fxg) =f f g(y)dy 

DFT ie am Paes O]* and 
b = [bo,...,bn_1,0,..., 0]? are 2N-vectors 
then a * b = c = [co,...,¢a2n_1]’, where 
Ck = Dojo ibe j 


To illustrate the use of the discrete convolution, consider polynomial mul- 
tiplication. Let a(x) = Sip akz" and b(£) = Eo  bpa® be ferrea ay —1) 
polynomials. Then their product c(x) = a(x) - b(x) = as 1 c,a*, where the 
coefficients cg,...,C2N—1 are given by the discrete convolution. 

One purpose of the Fourier transform, Fourier series, or DFT is to convert 
convolution into multiplication. In the case of the Fourier transform, F(f*g) = 
F(f)-F(g); i.e., the Fourier transform of the convolution is the product of the 
Fourier transforms. In the case of Fourier series, cj(f * g) = cj(f) - ¢j(g); 
i.e., the Fourier coefficients of the convolution are the product of the Fourier 
coefficients. The same is true of the discrete convolution. 


THEOREM 6.10. Leta = [ao,...,aNn_1,0,...,0]? andb = [bp,...,bn—1,0,..., 0] 
be vectors of dimension 2N, and let c = a*b = [co,...,c2n-1]'. Then 
(Bc) = (a)y : (Pd)g. 

Proof. If a’ = ®a, then a, = wee J ajwk4 , the value of the polynomial 
a(x) = DS ais a;xJ at £ = w". Similarly b' = ®b means bi, = Di Fa bjw" = 
b(w*) and d = ®c means c, = De cjw*I = c(w*). Therefore 


ai, - bi, = alw") - b(W") = c(w*) = c, 


as desired. 

In other words, the DFT is polynomial evaluation at the points w°,...,w%~1, 
and conversely the IDFT is polynomial interpolation, producing the coefficients 
of a polynomial given its values at w°,...,w%—!. 


6.7.4. Computing the Fast Fourier Transform 


We will derive the FFT via its interpretation as polynomial evaluation just 
discussed. The goal is to evaluate a(x) = aire apx" at x = w for 0 < j < 
N — 1. For simplicity we will assume N = 2™. Now write 

a(x) = ao + a£ +aox? +--+ ayr! 
(ao + azz? + aszt +- --) + 2(a1 + aga” + asx*+---) 


aeven (2°) +2- Aodd(Z”). 
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Thus, we need to evaluate two polynomials aeven and aoda of degree z —1 
1, 


at (w)?, 0 < j < N —1. But this is really just N points w?’ for 0 < j < 
2+7) 


2 
since w” = w 

Thus evaluating a polynomial of degree N — 1 = 2™ — 1 at all N Nth roots 
of unity is the same as evaluating two polynomials of degree N — 1 at all X 
Ath roots of unity and then combining the results with N multiplications and 
additions. This can be done recursively. 


ALGORITHM 6.14. FFT (recursive version): 


function FFT(a, N) 


if N=1 

return a 
else 

Gye = FFT (aeven, N/2) 

aldd = FFT (aoad, N/2) 

i = e7 2ri/N 

w = [w?,... wN 

return a' = |Qoyen + W. * al gq Aven — W. * alaa] 
endif 


Here .x means componentwise multiplication of arrays (as in Matlab), and we 
have used the fact that witN/2 = —wi, 


Let the cost of this algorithm be denoted C(N). Then we see that C(N) 
satisfies the recurrence C(N) = 2C(N/2) + 3N/2 (assuming that the powers 
of w are precomputed and stored in tables). To solve this recurrence write 


N 3N N 3N N 3N 
ow) = 20(F) +S (F) -80(Z) 43-5 


To compute the FFT of each column (or each row) of an N-by-N matrix 
therefore costs logs N - ane This complexity analysis justifies the entry for the 
FFT in Table 6.1. 

In practice, implementations of the FFT use simple nested loops rather 
than recursion in order to be as efficient as possible; see NETLIB/fftpack. 
In addition, these implementations sometimes return the components in bit- 
reversed order: This means that instead of returning yo, y1,...,Yn—1, where 
y = r, the subscripts j are reordered so that the bit patterns are reversed. 
For example, if N = 8, the subscripts run from 0 = 0002 to 7 = 111ə. The 
following table shows the normal order and the bit-reversed order: 
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normal increasing order  bit-reversed order 
0 = 0002 0 = 0002 
1 = 0012 4 = 1002 
2 = 0102 2 = 0102 
3 = 0112 6 = 1102 
4 = 1002 1 = 0012 
5 = 1012 5 = 1012 
6 = 1102 3 = 0112 
7 = 1112 7 = 1112 


The inverse FFT undoes this reordering and returns the results in their 
original order. Therefore, these algorithms can be used for solving the model 
problem, provided that we divide by the appropriate eigenvalues, whose sub- 
scripts correspond to bit-reversed order. (Note that Matlab always returns 
results in normal increasing order.) 


6.8. Block Cyclic Reduction 


Block cyclic reduction is another fast (O(N? log, N)) method for the model 
problem but is slightly more generally applicable than the FF T-based solution. 
The fastest algorithms for the model problem on vector computers are often a 
hybrid of block cyclic reduction and FFT. 

First we describe a simple but numerically unstable version version of the 
algorithm; then we say a little about how to stabilize it. Write the model 


problem as 
A -I 
ge ee, Ly by 


fie hg L oe A 
-I A 
where we assume that N, the dimension of A = Ty + 2In, is odd. Note also 
that x; and b; are N-vectors. 


We use block Gaussian elimination to combine three consecutive sets of 
equations, 


+ | —zj-2 bhAT 4 =j = bj-1 |; 
+Ax [ X51 +Ax; —Tj+1 = bj |, 
+ | =z; +ARj41 —%j42 =bjş |, 


thus eliminating zj—ı and xj41: 


Lj-24 (A? 2D ew; Lj+2 = 07-1 + Ab; + bj41. 
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Doing this for every set of three consecutive equations yields two sets of 
equations: one for the x; with j even, 


B -I 
E sees ee i T2 bı + Abə + b3 
: ; T4 b3 + Ab4 + bs 
| i ee ee | ie , (6.50) 
: aa | | TN-—1 | | by—2 + Abn_i + bn | 
-I B 


where B = 21 — A?, and one set of equations for the x; with j odd, which we 
can solve after solving equation (6.50) for the odd z3: 


A zı bi + £2 
A T3 b3 + £2 + £4 


| A | TN by + 2N-1 


Note that equation (6.50) has the same form as the original problem, so 
we may repeat this process recursively. For example, at the next step we get 


ver Wied fy 


| em | xg | where C = B? — 21, 


We repeat this until only one equation is left, which we solve another way. 
We formalize this algorithm as follows: Assume N = No = 2+! — 1, and 
let N, = 2*+1-7 — 1. Let A = A and b, =b; for j =1,...,N. 


ALGORITHM 6.15. Block cyclic reduction: 
1) Reduce: 


forr=Otok—-1 
Atl) = (A0) _ 9] 
for j =1 to Nra1 
Bern) = bo ;_1”) she AC) bzy) $ bajy” 
end for 
end for 
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Comment: at the rth step the problem is reduced to 


AW): -f 


Pee b) 
-I . ; 


[ara A Line 


2) Aa) = bE) is solved another way. 


3) Backsolve: 


forr=k-—1,...,0 
for j = 1 to Neat 
r) = aj" +1) 
end for 
forj =1 to N, step 2 
solve AM x (7) = b,” + zj + zj for go 
(we take a? = s a =0) 
end for 
end for 


Finally, £x = x) is the desired result. 


This simple approach has two drawbacks: 


1) It is numerically unstable because AC) grows quickly: ||A || ~ || A&—) ||? ~ 
4?" so in computing bj (r+1) the boi”) are lost in roundoff. 


2) AC) has bandwidth 2” + 1 if A is tridiagonal, so it soon becomes dense 
and thus expensive to multiply or solve. 


Here is a fix for the second drawback. Note that AC) is a polynomial p,(A) 
of degree 2”: 


po(A) = A and pr41(4) = (p,(A))? — 21. 


LEMMA 6.10. Let t = 2cos@. Then pr(t) = pr(2cos 6) = 2.cos(2"6). 


Proof. This is a simple trigonometric identity. 
Note that p,(t) = 2.cos(2” arccos($)) = 2T>r(4) where Tor (-) is a Chebyshev 
polynomial (see section 6.5.6). 


LEMMA 6.11. p,(t) = ee (t — tj), where tj = 2 cos(n 2+). 
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Proof. The zeros of the Chebyshev polynomials are given in Lemma 6.7. 
Thus AC) = Ii- (4 — 2 cos(n“Z=+)), so solving A()z = c is equivalent 
to solving 2” tridiagonal systems with tridiagonal coefficient matrices A + 
2 cos(m art), each of which costs O(N) via tridiagonal Gaussian elimination 
or Cholesky. 
More changes are needed to have a numerically stable algorithm. The final 


algorithm is due to Buneman and described in [46, 45]. 


We analyze the cost of the simple algorithm as follows; the stable algorithm 
is analogous. Multiplying by a tridiagonal matrix or solving a tridiagonal 
system of size N costs O(N) flops. Therefore multiplying by A or solving a 
system with AC) costs O(2"N) flops, since A) is the product of 2” tridiagonal 
matrices. The inner loop of step 1) of the algorithm therefore costs oh . 
O(2"N) = O(N?) flops to update the Np41 ~ oh vectors by PY A+ is not 
computed explicitly. Since the loop in step 1) is executed k ~ log, N times, 
the total cost of step 1) is O(N? log, N). For similar reasons, step 2) costs 
O(2*N) = O(N?) flops, and step 3) costs O(N? log, N) flops, for a total cost 
of O(N? log, N) flops. This justifies the entry for block cyclic reduction in 
Table 6.1. 

This algorithm generalizes to any block tridiagonal matrix with a sym- 
metric matrix A repeated along the diagonal and a symmetric matrix F that 
commutes with A (FA = AF) repeated along the offdiagonals. See also Ques- 
tion 6.10. This is a common situation when solving linear systems arising from 
discretized differential equations such as Poisson’s equation. 


6.9. Multigrid 


Multigrid methods were invented for partial differential equations such as Pois- 
son’s equation, but they work on a wider class of problems too. In contrast to 
other iterative schemes that we have discussed so far, multigrid’s convergence 
rate is independent of the problem size N, instead of slowing down for larger 
problems. As a consequence, it can solve problems with n unknowns in O(n) 
time or for a constant amount of work per unknown. This is optimal, modulo 
the (modest) constant hidden inside the O(-). 

Here is why the other iterative algorithms that we have discussed cannot 
be optimal for the model problem. In fact, this is true of any iterative al- 
gorithm that computes approximation £m+1 by averaging values of £m and 
the right-hand side 6 from neighboring grid points. This includes Jacobi’s, 
Gauss-Seidel, SOR(w), SSOR with Chebyshev acceleration (the last three with 
red-black ordering), and any Krylov subspace method based on matrix-vector 
multiplication with the matrix Ty; this is because multiplying a vector by 
Tyxn is also equivalent to averaging neighboring grid point values. Suppose 
that we start with a right-hand side b on a 31-by-31 grid, with a single nonzero 
entry, as shown in the upper left of Figure 6.9. The true solution x is shown 
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Right Hand Side True Solution 


Fig. 6.9. Limits of averaging neighboring grid points. 


in the upper right of the same figure; note that it is everywhere nonzero and 
gets smaller as we get farther from the center. The bottom left plot in Fig- 
ure 6.9 shows the solution x75 after 5 steps of Jacobi’s method, starting with 
an initial solution of all zeros. Note that the solution x75 is zero more than 
5 grid points away from the center, because averaging with neighboring grid 
points can “propagate information” only one grid point per iteration, and the 
only nonzero value is initially in the center of the grid. More generally, after k 
iterations only grid points within k of the center can be nonzero. The bottom 
right figure shows the best possible solution xpesz,5 obtainable by any “nearest 
neighbor” method after 5 steps: it agrees with x on grid points within 5 of 
the center and is necessarily 0 farther away. We see graphically that the error 
T Best,5 — X is equal to the size of x at the sixth grid point away from the center. 
This is still a large error; by formalizing this argument, one can show that it 
would take at least O(log) steps on an n-by-n grid to decrease the error by 
a constant factor less than 1, no matter what “nearest-neighbor” algorithm 
is used. If we want to do better than O(logn) steps (and O(nlogn) cost), 
we need to “propagate information” farther than one grid point per iteration. 
Multigrid does this by communicating with nearest neighbors on coarser grids, 
where a nearest neighbor on a coarse grid can be much farther away than a 
nearest neighbor on a fine grid. 


Multigrid uses coarse grids to do divide-and-conquer in two related senses. 
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First, it obtains an initial solution for an N-by-N grid by using an (N/2)- 
by-(V/2) grid as an approximation, taking every other grid point from the 
N-by-N grid. The coarser (N/2)-by-(V/2) grid is in turn approximated by an 
(N/4)-by-(N/4) grid, and so on recursively. The second way multigrid uses 
divide-and-conquer is in the frequency domain. This requires us to think of 
the error as a sum of eigenvectors, or sine-curves of different frequencies. Then 
the work that we do on a particular grid will eliminate the error in half of the 
frequency components not eliminated on other grids. In particular, the work 
performed on a particular grid—averaging the solution at each grid point with 
its neighbors, a variation of Jacobi’s method—makes the solution smoother, 
which is equivalent to getting rid of the high-frequency error. We will illustrate 
these notions further below. 


6.9.1. Overview of Multigrid on Two-Dimensional Poisson’s Equa- 
tion 


We begin by stating the algorithm at a high level and then fill in details. 
As with block cyclic reduction (section 6.8), it turns out to be convenient to 
consider a (2* — 1)-by-(2* — 1) grid of unknowns rather than the 2*-by-2* grid 
favored by the FFT (section 6.7). For understanding and implementation, it 
is convenient to add the nodes at the boundary, which have the known value 
0, to get a (2* + 1)-by-(2* + 1) grid, as shown in Figures 6.10 and 6.13. We 
also let Ny, = 2* — 1. 

We will let P® denote the problem of solving a discrete Poisson equation on 
a (2'+1)-by-(2' +1) grid with (2’ — 1)? unknowns, or equivalently a (N; + 2)- 
by-(N; + 2)) grid with N? unknowns. The problem P is specified by the 
right-hand side b and implicitly the grid size 2'—1 and the coefficient matrix 
T®Ö = Tn,xn,;- An approximate solution of P©® will be denoted «. Thus, b© 
and x are (2' — 1)-by-(2' — 1) arrays of values at each grid point. (The zero 
boundary values are implicit.) We will generate a sequence of related problems 
P®, PC), P©-2,..., P on increasingly coarse grids, where the solution 
to P“—) is a good approximation to the error in the solution of P™. 

To explain how multigrid works, we need some operators that take a prob- 
lem on one grid and either improve it or transform it to a related problem on 
another grid: 


e The solution operator S takes a problem P“) and its approximate solution 
z® and computes an improved 7: 


improved 2 = $(b, 2), (6.51) 


The improvement is to damp the “high-frequency components” of the 
error. We will explain what this means below. It is implemented by av- 
eraging each grid point value with its nearest neighbors and is a variation 
of Jacobi’s method. 
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Fig. 6.10. Sequence of grids used by two-dimensional multigrid. 


e The restriction operator R takes a right-hand side b from problem P“ 


and maps it to b@-), which is an approximation on the coarser grid: 
b=) = Ro). (6.52) 


Its implementation also requires just a weighted average with nearest 
neighbors on the grid. 


The interpolation operator In takes an approximate solution x=" for 
P©-) and converts it to an approximate solution x for the problem 
P® on the next finer grid: 


= In(x-Y), (6.53) 


Its implementation also requires just a weighted average with nearest 
neighbors on the grid. 


Since all three operators are implemented by replacing values at each grid 


point by some weighted averages of nearest neighbors, each operation costs 
just O(1) per unknown, or O(n) for n unknowns. This is the key to the low 
cost of the ultimate algorithm. 


Multigrid V-Cycle 


This is enough to state the basic algorithm, the multigrid V-cycle (MGV). 


ALGORITHM 6.16. MGV (the lines are numbered for later reference): 


function MGV (b , 2) ... replace an approximate solution ¢ 


. of P® with an improved one 
ifi=1 ... only one unknown 
compute the exact solution « of PY 
return x) 
else 
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1) r® = S(b®, gl) ... improve the solution 
2) r® = TM . ®© — pO ... compute the residual 
3) d® = In(MGV (4: R(r),0)) . solve recursively 
. on coarser grids 
4) rË) = 7 — q@ ... correct fine grid solution 
5) c = §(b9, a) ... improve the solution again 
return vc 
endif 


In words, the algorithm does the following: 

1. Starts with a problem on a fine grid (b, 2). 

2. Improves it by damping the high-frequency error: 2 = $(6, 2). 

3. Computes the residual r of the approximate solution 2. 

4. Approximates the fine grid residual r™ on the next coarser grid: R(r™). 


5. Solves the coarser problem recursively, with a zero initial guess: MGV (4- 
R(r),0). The factor 4 appears because of the h? factor in the right- 
hand side of Poisson’s equation, which changes by a factor of 4 from fine 
grid to coarse grid. 


6. Maps the coarse solution back to the fine grid: 
di = In(MGV(R(r), 0)) 


7. Subtracts the correction computed on the coarse grid from the fine grid 
solution: ¢ = ¢@ — q®, 


8. Improves the solution some more: « = S(b, 2), 


We justify the algorithm briefly as follows (we do the details later). Suppose 
(by induction) that d is the exact solution to the equation 


TÒ JÖ = rÀ = PO. zO pÒ, 


Rearranging, we get 
To. (x — d) = p) 


so that 2 — d is the desired solution. 

The algorithm is called a V-cycle, because if we draw it schematically in 
(grid number i, time) space, with a point for each recursive call to MGV, it 
looks like Figure 6.11, starting with a call to MGV(b©), x) in the upper left 
corner. This calls MGV on grid 4, then 3, and so on down to the coarsest grid 
1 and then back up to grid 5 again. 

Knowing only that the building blocks S, R, and In replace values at grid 
points by certain weighted averages of their neighbors, we know enough to do 
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> time 


Fig. 6.11. MGV. 


a O(-) complexity analysis of MGV. Since each building block does a constant 
amount of work per grid point, it does a total amount of work proportional to 
the number of grid points. Thus, each point at grid level i on the “V” in the 
V-cycle will cost O((2*—1)?) = O(4") operations. If the finest grid is at level k 
with n = O(4}) unknowns, then the total cost will be given by the geometric 


sum 
k 


XC O(4') = O(4*) = O(n). 


i=1 


Full Multigrid 


The ultimate multigrid algorithm uses the MGV just described as a building 
block. It is called full multigrid (FMG): 


ALGORITHM 6.17. FMG: 


function FMG(b), 2) ... return an accurate solution x) of P®) 
solve P exactly to get c 
fori=2 tok 
t = MGV (b®, In(2-V)) 
end for 


In words, the algorithm does the following: 
1. Solves the simplest problem P“) exactly. 


2. Given a solution x“~ of the coarse problem P(@~!), maps it to a starting 
guess « for the next finer problem P®: In(a—)), 
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i 


time 


Fig. 6.12. FMG. 


3. Solves the finer problem using the MGV with this starting guess: M GV (bo, 
In(z&-9)), 


Now we can do the overall O(-) complexity analysis of FMG. A picture of 
FMG in (grid number i, time) space is shown in Figure 6.12. There is one 
“V” in this picture for each call to MGV in the inner loop of FMG. The “V” 
starting at level i costs O(4') as before. Thus the total cost is again given by 


the geometric sum 
k 


> 0(4') = 0(4*) = O(n), 
i=1 
which is optimal, since it does a constant amount of work for each of the n 
unknowns. This explains the entry for multigrid in Table 6.1. 
A Matlab implementation of multigrid (both for the one and two-dimensional 
model problems) is available at HOMEPAGE/Matlab/MG README. html. 


6.9.2. Detailed Description of Multigrid on One-Dimensional Pois- 
son’s Equation 


Now we will explain in detail the various operators S, R, and In composing 
the multigrid algorithm and sketch the convergence proof. We will do this for 
Poisson’s equation in one dimension, since this will capture all the relevant 
behavior but is simpler to write. In particular, we can now consider a nested 
set of one-dimensional problems instead of two-dimensional problems, as shown 
in Figure 6.13. 

As before we denote by P“ the problem to be solved on grid i, namely, T® . 
az = 6, where as before N; = 2'—1 and T® = Ty,. We begin by describing 
the solution operator S, which is a form of weighted Jacobi convergence. 


Solution Operator in One Dimension 


In this subsection we drop the superscripts on T®, ¢, and b® for simplicity 
of notation. Let T = ZAZT be the eigendecomposition of T, as defined in 


338 Applied Numerical Linear Algebra 


2 2 1 1 1 


e + ai + + d ad + . e + +- + . e + . 
3 1 
p ) 1D grid of 9 points p?), 1D grid of 5 points pí ) 1D grid of 3 points 
7 unknowns 3 unknowns 1 unknown 
Points labeled 2 are Points labeled 1 are 
part of next coarser grid part of next coarser grid 


Fig. 6.13. Sequence of grids used by one-dimensional multigrid. 


Lemma 6.1. The standard Jacobi’s method for solving Tx = bis @m41 = Ray+ 
c, where R = I — T/2 and c = b/2. We consider weighted Jacobi convergence 
Lmt1 = Rw&m+Ccw, where Ry = I—wT/2 and cy = wb/2; w = 1 corresponds 
to the standard Jacobi’s method. Note that Rọ = Z(I — wA/2)Z7 is the 
eigendecomposition of Ry. The eigenvalues of Rọ determine the convergence 
of weighted Jacobi in the usual way: Let em = £m —« be the error at the mth 
iteration of weighted Jacobi convergence so that 


Em = Rylem_-1 

Rọ eo 

(Z(I — wA/2) ZT)” eo 
= Z(I—wA/2)"® Z" eg 


so 
Z em = (I-—wA/2)"Z% eg or (Z em); = (I — wh/2)"(Z* €9);. 


We call (Zl em); the jth frequency component of the error em, since em = 
Z(Z? em) is a sum of columns of Z weighted by the (ZTem)j, i.e., a sum of 
sinusoids of varying frequencies (see Figure 6.2). The eigenvalues Aj (Rw) = 1— 
wd, /2 determine how fast each frequency component goes to zero. Figure 6.14 
plots A;(Rw) for N = 99 and varying values of the weight w. 

When w = 3 and j > A, i.e., for the upper half of the frequencies Aj, we 


have |\;(Rw)| < 4. This means that the upper half of the error components 
(Z em); are multiplied by i or less at every iteration, independently of N. 
Low-frequency error components are not decreased as much, as we will see in 
Figure 6.15. So weighted Jacobi convergence with w = 2 is good at decreasing 
the high-frequency error. 

Thus, our solution operator S in equation (6.51) consists of taking one step 


of weighted Jacobi convergence with w = 2; 


When we want to indicate the grid 7 on which R/3 operates, we will instead 
write ie 
Figure 6.15 shows the effect of taking two steps of S for i = 6, where we 


have 2' — 1 = 63 unknowns. There are three rows of pictures, the first row 
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Fig. 6.14. Graph of the spectrum of Rw for N = 99 and w = 1 (Jacobi’s method), 
w = 1/2 and w = 2/3. 


showing the initial solution and error and the following two rows showing the 
solution £m and error em after successive applications of S. The true solution 
is a sine curve, shown as a dotted line in the leftmost plot in each row. The 
approximate solution is shown as a solid line in the same plot. The middle 
plot shows the error alone, including its two-norm in the label at the bottom. 
The rightmost plot shows the frequency components of the error ZTem. One 
can see in the rightmost plots that as S is applied, the right (upper) half of the 
frequency components are damped out. This can also be seen in the middle and 
left plots, because the approximate solution grows smoother. This is because 
high-frequency error looks like “rough” error and low-frequency error looks like 
“smooth” error. Initially, the norm of the vector decreases rapidly, from 1.65 
to 1.055, but then decays more gradually, because there is little more error in 
the high frequencies to damp. Thus, it only makes sense to do a few iterations 
of S at a time. 


Recursive Structure of Multigrid 


Using this terminology, we can describe the recursive structure of multigrid as 
follows. What multigrid does on the finest grid P“), is to damp the upper half 
of the frequency components of the error in the solution. This is accomplished 
by the solution operator S, as just described. On the next coarser grid, with 
half as many points, multigrid damps the upper half of the remaining frequency 
components in the error. This is because taking a coarser grid, with half as 
many points, makes frequencies appear twice as high, as illustrated in the 
example below. 
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Fig. 6.15. Illustration of weighted Jacobi convergence. 
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Schematic Description of Multigrid 
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Fig. 6.16. Schematic description of how multigrid damps error components. 


EXAMPLE 6.16. 


N=12, k=4 


N fine grid, 
low frequency, k< 3 


sin TE 
fr 1<j<11 


N=6, k=4 


high frequency, k > N 
TAF 
6 


coarse grid, 


sin 
for 1 <j <5 


On the next coarser grid, the upper half of the remaining frequency compo- 
nents are damped, and so on, until we solve the exact (one unknown) problem 
P®. This is shown schematically in Figure 6.16. The purpose of the restric- 
tion and interpolation operators is to change an approximate solution on one 
grid to one on the next coarser or next finer grid. 


Restriction Operator in One Dimension 


Now we turn to the restriction operator R, which takes a right-hand side r® 
from problem P“™ and approximates it on the next coarse grid, yielding r@—)). 
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Restriction by Samplin 
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Restriction by Smoothing 
1 Q T T T 


Fig. 6.17. Restriction from a grid with 24 — 1 = 15 points to a grid with 22 — 1 = 7 
points. (0 boundary values also shown.) 


The simplest way to compute r—) would be to simply sample r®) at the 
common grid points of the coarse and fine grids. But it is better to compute 
r@-) at a coarse grid point by averaging values of r on neighboring fine grid 
points: the value at a coarse grid point is .5 times the value at the corresponding 
fine grid point, plus .25 times each of the fine grid point neighbors. We call 
this smoothing. Both methods are illustrated in Figure 6.17. 

So altogether, we write the restriction operation as 


r&D = R(r®) 
= Pil. r® 
1 1 1 
4 2 4 
ae | 
4 


Nile 
Ale Ble 


m 
E NI= 
1 Ble 

Nie Ea 

A= 

| i 


1 
4 


The subscript i and superscript 7— 1 on the matrix Pet indicate that it maps 
from the grid with 2f — 1 points to the grid with 2°71 — 1 points. 

In two dimensions, restriction involves averaging with the eight nearest 
neighbors of each grid points: 1 times the grid cell value itself, plus t times 
the four neighbors to the left, right, top, and bottom, plus + times the four 
remaining neighbors at the upper left, lower left, upper right, and lower right. 
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Fig. 6.18. Interpolation from a grid with 23 — 1 = 7 points to a grid with 24 — 1 = 15 
points. (0 boundary values also shown.) 


Interpolation Operator in One Dimension 


The interpolation operator In takes an approximate solution d“~") on a coarse 
grid and maps it to a function dl) on the next finer grid. The solution d7» 
is interpolated to the finer grid as shown in Figure 6.18: we do simple linear 
interpolation to fill in the values on the fine grid (using the fact that the 
boundary values are known to be zero). Mathematically, we write this as 


Nie rR Nie 


= mni 


d® = In(d@-) = Pi} -da®9 = | 


| 
| 


The subscript i — 1 and superscript i on the matrix Pe indicate that it maps 
from the grid with 2471 — 1 points to the grid with 2f — 1 points. 

Note that Pia =2. (PSHE; In other words, interpolation and smoothing 
are essentially transposes of one another. This fact will be important in the 


Nile 


NI = ole 
L 


convergence analysis later. 
In two dimensions, interpolation again involves averaging the values at 
coarse nearest neighbors of a fine grid point (one neighbor if the fine grid point 


344 Applied Numerical Linear Algebra 


is also a coarse grid point; two neighbors if the fine grid point’s nearest coarse 
neighbors are to the left and right or top and bottom; and four neighbors 
otherwise). 


Putting It All Together 


Now we run the algorithm just described for eight iterations on the problem 
pictured in the top two plots of Figure 6.19; both the true solution x (on the 
top left) and right-hand side b (on the top right) are shown. The number 
of unknowns is 2’ — 1 = 127. We show how multigrid converges in the bot- 
tom three plots. The middle left plot shows the ratio of consecutive residuals 
|'m+1||/|\rm||, where the subscript m is the number of iterations of multigrid 
(i.e., calls to FMG, or Algorithm 6.17). These ratios are about .15, indicating 
that the residual decreases by more than a factor of 6 with each multigrid 
iteration. This quick convergence is indicated in the middle right plot, which 
shows a semilogarithmic plot of ||rm|| versus m; it is a straight line with slope 
log; 9(.15) as expected. Finally, the bottom plot plots all eight error vectors 
Lm —x. We see how they smooth out and become parallel on a semilogarithmic 
plot, with a constant decrease between adjacent plots of log,9(.15). 


Figure 6.20 shows a similar example for a two-dimensional model problem. 


Convergence Proof 


Finally, we sketch a convergence proof that shows that the overall error in an 
FMG “V”-cycle is decreased by a constant less than 1, independent of grid size 
N, = 2*—1. This means that the number of FMG V-cycles needed to decrease 
the error by any factor less than 1 is independent of k, and so the total work 
is proportional to the cost of a single FMG V-cycle, i.e., proportional to the 
number of unknowns n. 


We will simplify the proof by looking at one V-cycle and assuming by 
induction that the coarse grid problem is solved exactly [42]. In reality, the 
coarse grid problem is not solved quite exactly, but this rough analysis suffices 
to capture the spirit of the proof: that low-frequency error is eliminated on 
the coarser grid and high-frequency error is eliminated on the fine grid. 


Now let us write all the formulas defining a V-cycle and combine them all to 
get a single formula of the form “new e = M-e,” where e® = x — z is the 
error and M is a matrix whose eigenvalues determine the rate of convergence; 
our goal is to show that they are bounded away from 1, independently of i. 
The line numbers in the following table refer to Algorithm 6.16. 
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Fig. 6.19. Multigrid solution of one-dimensional model problem. 
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Fig. 6.20. Multigrid solution of two-dimensional model problem. 


(a) 29 = Soli) el) = RY),2® + 00/3 
by line 1) and equation (6.54), 
b) rÒ = TO. gO p0 
by line 2), 
d® = In(MGV(4- R(r®),0)) 
by line 3) 


= In([TE9]" (4- R(r))) 

by our assumption that the 

coarse grid problem is solved exactly 
= In((T)]~ (4. Ptr) 

by equation (6.55) 


(o) = Pj [TED] (4 Pi rO) 
by equation (6.56) 

(4) 2 = tada 
by line 4) 

(e) 2) = S(b(i),#(i)) = RẸ) rO + b/3 
by line 5). 


In order to get equations updating the error el), we subtract the identity 


c= Rot + b /3 from lines (a) and (e) above, 0 = T® - a —b® from line (b), 
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and x = x from line (d) to get 
(a) e = RO e® 


2/38 > 
(b) r®) =T®). ef), 

i i i—1)] 71 i—l„(i 
(c) d® =P 1] (4: P; 1r®)), 
(d) e@) =e) — d®, 
(e) e = ee 


Substituting each of the above equations into the next yields the following 
formula, showing how the error is updated by a V-cycle: 


(2) 
R3 j3 


= M-e®. (6.57) 


: 4 : —1 ; F . f 
E eG) { Pel) aa Fir) RO, el 


Now we need to compute the eigenvalues of M. We first simplify equa- 
tion (6.57), using the facts that P; = 2. (PHT and 


ee ae a (6.58) 


(see Question 6.15). Substituting these into the expression for M in equa- 
tion (6.57) yields 

i i— i—1m(i) ( pi— ay i-1 (i i 

MR, f- (PhP. K ITO (P al Tis ag, 


or, dropping indices to simplify notation, 
M = Raja {I — PT. [PTP] . PT} Raja. (6.59) 


We continue, using the fact that all the matrices composing M (T, R2/3, 
and P) can be (nearly) diagonalized by the eigenvector matrices Z = Z and 
Z%-)) of T = T® and T¢-, respectively: Recall that Z = ZT = Z-1, T = 
ZAZ, and Ro/3 = Z(I —A/3)Z = ZARZ. We leave it to the reader to confirm 
that Z“-) PZ = Ap, where Ap is almost diagonal (see Question 6.15): 


(+1 + cos 4)/V8 if k= 4%, 
Apjk = 4 (—1+cos34)/v8 ifk=2-— j, (6.60) 
0 otherwise. 


This lets us write 
ZMZ = (ZRzj32) 
: : : -1 
; fı — (ZPT ZĆ®-Đ). (2) Pz)(ZTZ)\(ZPT Ve 
(ZY pZz(Z7Z)} - (ZRo/3Z) 


Be! Ade fr- AS [AAA] APA} jhe, 
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The matrix ZMZ is similar to M since Z = ZT! and so has the same eigen- 
values as M. Also, ZMZ is nearly diagonal: it has nonzeros only on its main 
diagonal and “perdiagonal” (the diagonal from the lower left corner to the 
upper right corner of the matrix). This lets us compute the eigenvalues of T 
explicitly. 


THEOREM 6.11. The matrix M has eigenvalues 1/9 and 0, independent of 
i. Therefore multigrid converges at a fixed rate independent of the number of 
unknowns. 


For a proof, see Question 6.15. For a more general analysis, see [266]. 
For an implementation of this algorithm, see Question 6.16. The Web site 
[89] contains pointers to an extensive literature, software, and so on. 


6.10. Domain Decomposition 


Domain decomposition for solving sparse systems of linear equations is a topic 
of current research. See [48, 114, 203] and especially [230] for recent surveys. 
We will give only simple examples. 

The need for methods beyond those we have discussed arises from of the 
irregularity and size of real problems and also from the need for algorithms 
for parallel computers. The fastest methods that we have discussed so far, 
those based on block cyclic reduction, the FFT, and multigrid, work best 
(or only) on particularly regular problems such as the model problem, i.e., 
Poisson’s equation discretized with a uniform grid on a rectangle. But the 
region of solution of a real problem may not be a rectangle but more irregular, 
representing a physical object like a wing (see Figure 2.12). Figure 2.12 also 
illustrates that there may be more grid points in regions where the solution is 
expected to be less smooth than in regions with a smooth solution. Also, we 
may have more complicated equations than Poisson’s equation or even different 
equations in different regions. Independent of whether the problem is regular, 
it may be too large to fit in the computer memory and may have to be solved 
“in pieces.” Or we may want to break the problem into pieces that can be 
solved in parallel on a parallel computer. 

Domain decomposition addresses all these issues by showing how to sys- 
tematically create “hybrid” methods from the simpler methods discussed in 
previous sections. These simpler methods are applied to smaller and more reg- 
ular subproblems of the overall problem, after which these partial solutions are 
“pieced together” to get the overall solution. These subproblems can be solved 
one at a time if the whole problem does not fit into memory, or in parallel on 
a parallel computer. We give examples below. There are generally many ways 
to break a large problem into pieces, many ways to solve the individual pieces, 
and many ways to piece the solutions together. Domain decomposition theory 
does not provide a magic way to choose the best way to do this in all cases 
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but rather a set of reasonable possibilities to try. There are some cases (such 
as problems sufficiently like Poisson’s equation) where the theory does yield 
“optimal methods” (costing O(1) work per unknown). 


We divide our discussion into two parts, nonoverlapping methods and over- 
lapping methods. 


6.10.1. Nonoverlapping Methods 


This method is also called substructuring or a Schur complement method in the 
literature. It has been used for decades, especially in the structural analysis 
community, to break large problems into smaller ones that fit into computer 
memory. 


For simplicity we will illustrate this method using the usual Poisson’s equa- 
tion with Dirichlet boundary conditions discretized with a 5-point stencil but 
on an L-shaped region rather than a square. This region may be decomposed 
into two domains: a small square and a large square of twice the side length, 
where the small square is connected to the bottom of the right side of a larger 
square. We will design a solver that can exploit our ability to solve problems 
quickly on squares. 


In the figure below, the number of each grid point is shown for a coarse 
discretization (the number is above and to the left of the corresponding grid 
point; only grid points interior to the “L” are numbered). 


9 8 7, 6 5 30, 2 


Note that we have numbered first the grid points inside the two subdomains 
(1 to 4 and 5 to 29) and then the grid points on the boundary (30 and 31). 
The resulting matrix is 
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4 -1|-1 
-1 4) -1 -1 
-1 4-1 
-1|-1 4 =l 
4 -1 1 1 
-1 4-1 -1 
-1 4-1 7 
-1 4-1 -1 
-14 -1 
1 4-1 -1 -1 
-1 -1 4- 
-1 -1 4-1 -1 
-1 -1 4-1 -1 
-1 -14 1 
-1 4 -1 
-1 -1 4-1 
4-1 -1 
-1 -1 4-1 z 
-1 -1 4 1 
-1 4- -1 
-1 4-1 1 
-1 - = -1 
-1 -1 4-1 1 
1 -1 4 -1 
-1 4-1 
-1 4-1 
-1 -1 4-1 
- -1 4-1 
Pal -14 
-1 1 4-1 
1 1 -1 4 
Aji || 0 || Ais 
=Az= 0 Ago A23 


Af || A33 || Ass 


Here, Ai = T2x2, Ago = T5x5, and A33 = Toxi = To + 212, where Tn is 
defined in equation (6.3) and Ty. is defined in equation (6.14). One of the 
most important properties of this matrix is that Aj2 = 0, since there is no 
direct coupling between the interior grid points of the two subdomains. The 
only coupling is through the boundary, which is numbered last (grid points 30 
and 31). Thus Aj3 contains the coupling between the small square and the 
boundary, and A23 contains the coupling between the large square and the 
boundary. 

To see how to take advantage of the special structure of A to solve Ax = b, 
write the block LU decomposition of A as follows: 


I 0 0 I 0 0 Ay 0 Ais 
A= 0 I 0-10 7 O]- 0 Ago Az |, 
ATAG AAJ I 00 S O° 0 I 


S = A33 — AT, AŢi A13 — A23 A37 A23 (6.61) 


Iterative Methods for Linear Systems 351 


is called the Schur complement of the leading principal submatrix containing 
Ay, and Ag2. Therefore, we may write 


ATE 
Ay 0 -AṢA I0 0 I 0 0 
0 A3} -Az A43 |- |0 F 0 |- 0 i 0 
0 0 I 0 0 s7! -ARA —Ak,Am I 


Therefore, to multiply a vector by A~! we need to multiply by the blocks in 


the entries of this factored form of A~!, namely, A13 and Ag3 (and their trans- 
poses), AY and As and S7}. Multiplying by A13 and A23 is cheap because 
they are very sparse. Multiplying by Age and Az is also cheap because we 
chose these subdomains to be solvable by FFT, block cyclic reduction, multi- 
grid, or some other fast method discussed so far. It remains to explain how to 
multiply by S71. 

Since there are many fewer grid points on the boundary than in the subdo- 
mains, A33 and S have a much smaller dimension than Aj, and Ag9; this effect 
grows for finer grid spacings. S is symmetric positive definite, as is A, and (in 
this case) dense. To compute it explicitly one would need to solve with each 
subdomain once per boundary grid point (from the Ay A13 and A37 A23 terms 
in (6.61)). This can certainly be done, after which one could factor S using 
dense Cholesky and proceed to solve the system. But this is expensive, much 
more so than just multiplying a vector by S, which requires just one solve 
per subdomain using equation (6.61). This makes a Krylov subspace—based 
iterative method such as CG look attractive (section 6.6), since these methods 
require only multiplying a vector by S. The number of matrix-vector multi- 
plications CG requires depends on the condition number of S. What makes 
domain decomposition so attractive is that S turns out to be much better con- 
ditioned that the original matrix A (a condition number that grows like O(N) 
instead of O(N?)), and so convergence is fast [114, 203]. 

More generally, one has k > 2 subdomains, separated by boundaries (see 
Figure 6.21, where the heavy lines separate subdomains). If we number the 
nodes in each subdomain consecutively, followed by the boundary nodes, we 
get the matrix 


Ai 0 Ai k+ 
A= ua : , (6.62) 
0 Akk Ak,k+1 
Alky “oF Al k+l | Ak+1,k+1 


where again we can factor it by factoring each A; ; independently and forming 
the Schur complement S = Ak+1,k+1 — S AT 145; Aine: 

In this case, when there is more than one boundary segment, S has further 
structure that can be exploited to precondition it. For example, by numbering 


the grid points in the interior of each boundary segment before the grid points 
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at the intersection of boundary segments, one gets a block structure as in A. 
The diagonal blocks of S are complicated but may be approximated by T vy z 
which may be inverted efficiently using the FFT [35, 36, 37, 38, 39]. To sum- 
marize the state of the art, by choosing the preconditioner for S appropriately, 
one can make the number of steps of conjugate gradient independent of the 
number of boundary grid points N [229]. 


6.10.2. Overlapping Methods 


The methods in the last section were called nonoverlapping because the do- 
mains corresponding to the nodes in A; were disjoint, leading to the block 
diagonal structure in equation (6.62). In this section we permit overlapping 
domains, as shown in the figure below. As we will see, this overlap permits us 
to design an algorithm comparable in speed with multigrid but applicable to 
a wider set of problems. 


The rectangle with a dashed boundary in the figure is domain Q4, and the 
square with a solid boundary is domain Q2. We have renumbered the nodes 
so that the nodes in Qı are numbered first and the nodes in Q2 are numbered 
last, with the nodes in the overlap Q1 N Qə in the middle. 


Q2 
31, 26, 21, 16, 13 


30, 25. 20, 15, 12, 
29, 24, 19. 141} -L-2 
28 23, 18 10, & a 4 2 
27, 22 1% % 2% 3 
Q1 


These domains are shown in the matrix A below, which is the same matrix 
as in section 6.10.1 but with its rows and columns ordered as shown above: 
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4-1-1 
-1 4 -1 
-1 4-1-1 
-1-1 4 -1 
-1 4 -1]|-1 
-1 -1 4 -1 
-1 4 -1 -1 
-1||-1 4  -1j-1 
-1 4-1 -1 
-1 -1 4 - -1 
-1 4-1 - 
-1 4-1 -1 
-1 4 -1 
-1ļ|-1 4 -1 -1 
-1 -1 4-1 -1 
-1 -1 4 -1 
-1 4 -1 -1 
-1 -1 4-1 - 
- -1 4-1 -1 
-1 -1 4-1 - 
-1 -1 4 -1 
-1 4- -1 
-1 -1 4-1 -1 
-1 -1 4- -1 
-1 -1 4-1 -1 
-1 -1 4 -1 
-1 4-1 
- -1 4-1 
-1 -1 4-1 
- -1 4-1 
-1 -1 4 


We have indicated the boundaries between domains in the way that we 
have partitioned the matrix: The single lines divide the matrix into the nodes 
associated with Q; (1 through 10) and the rest Q \ Qı (11 through 31). The 
double lines divide the matrix into the nodes associated with Q2 (7 through 
31) and the rest Q \ Q2 (1 through 6). The submatrices below are subscripted 
accordingly: 


A= A, | Ag, AQ | 
AA, a, | AA, A 


Ag, , Q\Q2 Ag, Q2 


AAN , AN Aaya» „Q2 | 


We conformally partition vectors such as 
ies zm |__| æ(1:10) 
TO\N2 x(1 : 6) 
LA» wT OSL) |` 
Now we have enough notation to state two basic overlapping domain decom- 
position algorithms. The simplest one is called the additive Schwarz method for 


historical reasons but could as well be called overlapping block Jacobi iteration 
because of its similarity to (block) Jacobi iteration from sections 6.5 and 6.6.5. 
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ALGORITHM 6.18. Additive Schwarz method for updating an approximate so- 
lution x; of Ax = b to get a better solution x41: 


r=b— Az; /* compute the residual */ 


Li41 =9 
Li41,01, = Ti Qı + Ad} a TOY /* update the solution on Qi */ 
Li41,Q2 = Vi41,Q + As) o ‘TOs /* update the solution on Q */ 


This algorithm also be written in one line as 


0 
ae 
Q2, N2 ` 722 


In words, the algorithm works as follows: The update AD, oro corresponds 
to solving Poisson’s equation just on Q1, using boundary conditions at nodes 
11, 14, 17, 18, and 19, which depend on the previous approximate solution zi. 
The update Ag. Q79 İS analogous, using boundary conditions at nodes 5 and 
6 depending on 2;. 

In our case the Q; are rectangles, so any one of our earlier fast methods, 
such as multigrid, could be used to solve Ag? UT% Since the additive Schwarz 
method is iterative, it is not necessary to solve the problems on Q; exactly. 

Indeed, the additive Schwarz method is typically used as a preconditioner 
for a Krylov subspace method like conjugate gradients (see section 6.6.5). In 
the notation of section 6.6.5, the preconditioner M is given by 


E Aade] + Ho 


o fo 0 | Aa, o 


Az! -rQ 
n= | S f+ 


If Qı and Qə did not overlap, then M~! would simplify to 


-1 
Am, K 
0 Ag; , 22 


and we would be doing block Jacobi iteration. But we know that Jacobi’s 
method does not converge particularly quickly, because “information” about 
the solution from one domain can only move slowly to the other domain across 
the boundary between them (see the discussion at the beginning of section 6.9). 
But as long as the overlap is a large enough fraction of the two domains, infor- 
mation will travel quickly enough to guarantee fast convergence. Of course we 
do not want too large an overlap, because this increases the work significantly. 
The goal in designing a good domain decomposition method is to choose the 
domains and the overlaps so as to have fast convergence while doing as little 
work as possible; we say more on how convergence depends on overlap below. 


Iterative Methods for Linear Systems 355 


From the discussion in section 6.5, we know that the Gauss-Seidel method 
is likely to be more effective than Jacobi’s method. This is the case here as 
well, with the overlapping block Gauss-Seidel method (more commonly called 
the multiplicative Schwarz method) often being twice as fast as additive block 
Jacobi iteration (the additive Schwarz method). 


ALGORITHM 6.19. Multiplicative Schwarz method for updating an approxi- 
mate solution x; of Ax = b: 


) re, = (b- Azri)o /* compute residual of x; on Qi */ 
2) Tigi m = Fi, + Agi on “TO, /* update solution on Qı */ 
if 


) Titt NO, = Ti, AN 
) TQ, = (b — ATi 1) 0% /* compute residual of Tizi On Q2 */ 
4)  Ti41, Q = Tizi Q, + Ag: Q` Tə /* update solution on Q2 */ 
2° , 
1 


4) Titi, An = Titi, AM 


Note that lines (2') and (4') do not require any data movement, provided that 
Dey and £441 overwrite x;. 
2 


This algorithm first solves Poisson’s equation on Qı using boundary data 
from x;, just like Algorithm 6.18. It then solves Poisson’s equation on Q2, but 
using boundary data that has just been updated. It may also be used as a 
preconditioner for a Krylov subspace method. 

In practice more domains than just two (Qı and Q2) are used. This is done 
if the domain of solution is more complicated or if there are many independent 
parallel processors available to solve independent problems Ag} 9,79: or just 


to keep the subproblems Ag, QTO small and inexpensive to solve. 

Here is a summary of the theoretical convergence analysis of these methods 
for the model problem and similar elliptic partial differential equations. Let h 
be the mesh spacing. The theory predicts how many iterations are necessary to 
converge as a function of h as h decreases to 0. With two domains, as long as 
the overlap region Q1 NQ2 is a nonzero fraction of the total domain Q1 UQ2, the 
number of iterations required for convergence is independent of h as h goes to 
zero. This is an attractive property and is reminiscent of multigrid, which also 
converged at a rate independent of mesh size h. But the cost of an iteration 
includes solving subproblems on 2; and Q2 exactly, which may be comparable 
in expense to the original problem. So unless the solutions on 2; and Q2 are 
very cheap (as with the L-shaped region above), the cost is still high. 

Now suppose we have many domains Q;, each of size H >> h. In other 
words, think of the Q; as the regions bounded by a coarse mesh with spac- 
ing H, plus some cells beyond the boundary, as shown by the dashed line in 
Figure 6.21. 

Let 6 < H be the amount by which adjacent domains overlap. Now let H, 
6, and h all go to zero such that the overlap fraction 6/H remains constant, 
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Fig. 6.21. Coarse and fine discretizations of an L-shaped region. 


and H œ h. Then the number of iterations required for convergence grows 
like 1/H, i.e., independently of the fine mesh spacing h. This is close to, but 
still not as good as, multigrid, which does a constant number of iterations and 
O(1) work per unknown. 

Attaining the performance of multigrid requires one more idea, which, per- 
haps not surprisingly, is similar to multigrid. We use an approximation Ay 
of the problem on the coarse grid with spacing H to get a coarse grid precon- 
ditioner in addition to the fine grid preconditioners As) or We need three 
matrices to describe the algorithm. First, let Ay be the matrix for the model 
problem discretized with coarse mesh spacing H. Second, we need a restriction 
operator R to take a residual on the fine mesh and restrict it to values on the 
coarse mesh; this is essentially the same as in multigrid (see section 6.9.2). 
Finally, we need an interpolation operator to take values on the coarse mesh 
and interpolate them to the fine mesh; as in multigrid this also turns out to 
be RT. 


ALGORITHM 6.20. Two-level additive Schwarz method for updating an approx- 
imate solution x; of Ax = b to get a better solution x441: 


Ti+1 = Ti 
for i= 1 to the number of domains Q; 
ro, = (b — Azi)o, 
Ti+, Q; = Ti+, Q; + Aa, h TO 
endfor 
Tiai = izi t RT AG! Rr 


As with Algorithm 6.18, this method is typically used as a preconditioner 
for a Krylov subspace method. 

Convergence theory for this algorithm, which is applicable to more general 
problems than Poisson’s equation, says that as H, 6, and h shrink to 0 with 
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6/H staying fixed, the number of iterations required to converge is independent 
of H, h or 6. This means that as long as the work to solve the subproblems 
Ag; Q; and Ay is proportional to the number of unknowns, the complexity is 
as good as multigrid. 

It is probably evident to the reader that implementing these methods in a 
real world problem can be complicated. There is software available on-line that 
implements many of the building blocks described here and also runs on parallel 
machines. It is called PETSc, for Portable Extensible Toolkit for Scientific 
computing. PETSc is available at http://www.mcs.anl.gov/petsc/petsc.html 
and is described briefly in [230]. 


6.11. References and Other Topics for Chapter 6 


Up-to-date surveys of modern iterative methods are given in [15, 105, 134, 212], 
and their parallel implementations are also surveyed in [75]. Classical methods 
such as Jacobi’s, Gauss-Seidel, and SOR methods are discussed in detail in 
(247, 135]. Multigrid methods are discussed in [42, 183, 184, 258, 266] and the 
references therein; [89] is a Web site with pointers to an extensive bibliography, 
software, and so on. Domain decomposition are discussed in [48, 114, 203, 230]. 
Chebyshev and other polynomials are discussed in [238]. The FFT is discussed 
in any good textbook on computer science algorithms, such as [3] and [246]. 
A stabilized version of block cyclic reduction is found in [46, 45]. 


6.12. Questions for Chapter 6 


QUESTION 6.1. (Easy) Prove Lemma 6.1. 


QUESTION 6.2. (Easy) Prove the following formulas for triangular factoriza- 
tions of Ty. 


1. The Cholesky factorization Ty = BIB N has a upper bidiagonal Cholesky 
factor By with 


SE b+ 1 a i 
Byli) = and Byli +1) = VF 


2. The result of Gaussian elimination with partial pivoting on Ty is Ty = 
LNUy, where the triangular factors are bidiagonal: 


Lyn(i,i)=1 and Lnli+ 1i) =- 
i 


oi 
U RAEE a ES ERN 
1 
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3. Ty = DnD4,, where Dy is the N-by-(N + 1) upper bidiagonal matrix 


with 1 on the main diagonal and —1 on the superdiagonal. 


QUESTION 6.3. (Easy) Confirm equation (6.13). 


QUESTION 6.4. (Easy) 


1. Prove Lemma 6.2. 


2. Prove Lemma 6.3. 


3. Prove that the Sylvester equation AX — X B = C is equivalent to 


(In ® A— BT @ Im)vec(X) = vec(C). 


4. Prove that vec(AX B) = (BT @ A) - vec(X). 


QUESTION 6.5. (Medium) Suppose that A”*” is diagonalizable, so A has n 
independent eigenvectors: Ax; = Qizi, or AX = XAy, where X = [21,...,2n] 


and A4 = diag(a;). Similarly, suppose that 


B™*"™ is diagonalizable, so b has m 


independent eigenvectors: By; = biyi, or BY = YAp, where Y = [y,..-, Ym] 
and Ag = diag(@;). Prove the following results. 


1. The mn eigenvalues of Im 8 A+B 8 In are Aij = a; + fj, i.e., all possible 


sums of pairs of eigenvalues of A and B. The corresponding eigenvectors 
are zij, Where zij = xi Q yj, whose (km+1)th entry is x;(k)y;(l). Written 
another way, 


(Im ® A+ B@In(Y @X)=(VY @X):(Im@Aa+AB@In). (6.63) 


. The Sylvester equation AX + XBT = C is nonsingular (solvable for X, 


given any C) if and only if the sum a; + 3; = 0 for all eigenvalues a; 
of A and p; of B. The same is true for the slightly different Sylvester 
equation AX + XB = C (see also Question 4.6). 


. The mn eigenvalues of A @® B are Aij = a;3;, i.e., all possible products 


of pairs of eigenvalues of A and B. The corresponding eigenvectors are 
Zij, Where zij = xi Q yj, whose (km + l)th entry is x;(k)y;(1). Written 
another way, 


(B® AY 8 X) = (Y @X)- (Ag Aa). (6.64) 


QUESTION 6.6. (Easy; Programming) Write a one-line Matlab program to im- 
plement Algorithm 6.2: one step of Jacobi’s algorithm for Poisson’s equation. 
Test it by confirming that it converges as fast as predicted in section 6.5.4. 


QUESTION 6.7. (Hard) Prove Lemma 6.7. 
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QUESTION 6.8. (Medium; Programming) Write a Matlab program to solve the 
discrete model problem on a square using FFTs. The inputs should be the di- 
mension N and a square N-by-N matrix of values of fij. The outputs should be 
an N-by-N matrix of solution v;; and the residual ||Tyx wv—h? f|l2/(||Tn xn ll2- 
||v||). You should also produce three-dimensional plots of f and v. Use the fft 
built in to Matlab. Your program should not have to be more than a few lines 
long if you use all the features of Matlab that you can. Solve it for several 
problems whose solutions you know and several you do not: 


1. fik = sin(ja/(N + 1)) -sin(ka/(N +1). 
2. fik = sin(jz/(N+1))-sin(ka/(N+1)+sin(3j7/(N+1))-sin(5ka/(N-+1). 


3. f has a few sharp spikes (both positive and negative) and is 0 elsewhere. 
This approximates the electrostatic potential of charged particles located 
at the spikes and with charges proportional to the heights (positive or 
negative) of the spikes. If the spikes are all positive, this is also the 
gravitational potential. 


QUESTION 6.9. (Medium) Confirm that evaluating the formula in (6.47) by 
performing the matrix-vector multiplications from right to left is mathemati- 
cally the same as Algorithm 6.13. 


QUESTION 6.10. (Medium; Hard) 


1. (Hard) Let A and H be real symmetric n-by-n matrices that commute, 
i.e., AH = HA. Show that there is an orthogonal matrix Q such that 
QAQ? = diag(a1,..., an) and QHQ? = diag(1,...,9n) are both diag- 
onal. In other words, A and H have the same eigenvectors. Hint: First 
assume A has distinct eigenvalues, and then remove this assumption. 


2. (Medium) Let 


te ee 
0 a 

be a symmetric tridiagonal Toeplitz matrix, i.e., a symmetric tridiagonal 

matrix with constant a along the diagonal and @ along the offdiagonals. 


Write down simple formulas for the eigenvalues and eigenvectors of T, 
Hint: Use Lemma 6.1. 


3. (Hard) Let 


360 Applied Numerical Linear Algebra 


be an n?-by-n? block tridiagonal matrix, with n copies of A along the 
diagonal. Let QAQ? = diag(aj,...,@,) be the eigendecomposition of 
A, and let QHQ? = diag(01,..., 0n) be the eigendecomposition of H as 
above. Write down simple formulas for the n? eigenvalues and eigenvec- 
tors of T in terms of the a;, 0i, and Q. Hint: Use Kronecker products. 


4. (Medium) Show how to solve Tx = b in O(n?) time. In contrast, how 
much bigger are the running times of dense LU factorization and band 
LU factorization? 


5. (Medium) Suppose that A and H are (possibly different) symmetric tridi- 
agonal Toeplitz matrices, as defined above. Show how to use the FFT to 
solve Tx = b in just O(n? logn) time. 


QUESTION 6.11. (Easy) Suppose that R is upper triangular and nonsingular 
and that C is upper Hessenberg. Confirm that RCR7! is upper Hessenberg. 


QUESTION 6.12. (Medium) Confirm that the Krylov subspace K,(A, y1) has 
dimension k if and only if the Arnoldi algorithm (Algorithm 6.9) or the Lanczos 
algorithm (Algorithm 6.10) can compute gy, without quitting first. 


QUESTION 6.13. (Medium) Confirm that when A”*” is symmetric positive 
definite and Q”** has full column rank, then T = Q7 AQ is also symmetric 
positive definite. (For this question, Q need not be orthogonal.) 


QUESTION 6.14. (Medium) Prove Theorem 6.9. 


QUESTION 6.15. (Medium; Hard) 
1. (Medium) Confirm equation (6.58). 
2. (Medium) Confirm equation (6.60). 


3. (Hard) Prove Theorem 6.11. 


QUESTION 6.16. (Medium; Programming) A Matlab program implementing 
multigrid to solve the discrete model problem on a square is available on the 
class homepage at HOMEPAGE/Matlab/MG README. html. Start by run- 
ning the demonstration (type “makemgdemo” and then “testfmgv”). Then, 
try running testfmg for different right-hand sides (input array b), different 
numbers of weighted Jacobi convergence steps before and after each recursive 
call to the multigrid solver (inputs jacl and jac2), and different numbers of 
iterations (input iter). The software will plot the convergence rate (ratio of 
consecutive residuals); does this depend on the size of b? the frequencies in b? 
the values of jacl and jac2? For which values of jacl and jac2 is the solution 
most efficient? 
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QUESTION 6.17. (Medium; Programming) Using a fast model problem solver 
from either Question 6.8 or Question 6.16, use domain decomposition to build 
a fast solver for Poisson’s equation on an L-shaped region, as described in 
section 6.10. The large square should be 1-by-1 and the small square should 
be .5-by-.5, attached at the bottom right of the large square. Compute the 
residual in order to show that your answer is correct. 


QUESTION 6.18. (Hard) Fill in the entries of a table like Table 6.1, but for 
solving Poisson’s equation in three dimensions instead of two. Assume that 
the grid of unknowns is N x N x N, with n = NÌ. Try to fill in as many entries 
of columns 2 and 3 as you can. 


T 


Iterative Algorithms for Eigenvalue 
Problems 


7.1. Introduction 


In this chapter we discuss methods for finding eigenvalues of matrices that 
are too large to use the “dense” algorithms of Chapters 4 and 5. In other 
words, we seek algorithms that take far less than O(n?) storage and O(n?) 
flops. Since the eigenvectors of most n-by-n matrices would take n? storage to 
represent, this means that we seek algorithms that compute just a few user- 
selected eigenvalues and eigenvectors of a matrix. 

We will depend on the material on Krylov subspace methods developed in 
section 6.6, the material on symmetric eigenvalue problems in section 5.2, and 
the material on the power method and inverse iteration in section 5.3. The 
reader is advised to review these sections. 

The simplest eigenvalue problem is to compute just the largest eigenvalue in 
absolute value, along with its eigenvector. The power method (Algorithm 4.1) 
is the simplest algorithm suitable for this task: Recall that its inner loop is 


Yi+yı = Age 
Yi+1/ lyi l2 


where x; converges to the eigenvector corresponding to the desired eigenvector 
(provided that there is only one eigenvalue of largest absolute value, and 21 
is not orthogonal to its eigenvector). Note that the algorithm uses A only 
to perform matrix-vector multiplication, so all that we need to run the algo- 
rithm is a “black-box” that takes x; as input and returns Ax; as output (see 
Example 6.13). 

A closely related problem is to find the eigenvalue closest to a user-supplied 
value g, along with its eigenvector. This is precisely the situation inverse 
iteration (Algorithm 4.2) was designed to handle. Recall that its inner loop is 


Yi = (A-ol) tti, 


Ti+1 = Yi+1/|lyi+ll2, 


Ti+1 
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i.e., solving a linear system of equations with coefficient matrix A—ol. Again x; 
converges to the desired eigenvector, provided that there is just one eigenvalue 
closest to ø (and xı is not orthogonal to its eigenvector). Any of the sparse 
matrix techniques in Chapter 6 or section 2.7.4 could be used to solve for 
yi+1, although this is usually much more expensive than simply multiplying 
by A. When A is symmetric Rayleigh quotient iteration (Algorithm 5.1) can 
also be used to accelerate convergence (although it is not always guaranteed 
to converge to the eigenvalue of A closest to ø). 

Starting with a given x1, k — 1 iterations of either the power method or 
inverse iteration produce a sequence of vectors £1, %2,..., £k. These vectors 
span a Krylov subspace, as defined in section 6.6.1. In the case of the power 
method, this Krylov subspace is K;,(x%1, A) = span[z1, Azı, A2x1,..., A*~! x4], 
and in the case of inverse iteration this Krylov subspace is K;,(a1,(A—oI)~). 
Rather than taking xz, as our approximate eigenvector, it is natural to ask 
for the “best” approximate eigenvector in Ky, i.e., the best linear combination 
pa Qizi. We took the same approach for solving Ax = b in section 6.6.2, 
where we asked for the best approximate solution to Ax = b from Kg. We 
will see that the best eigenvector (and eigenvalue) approximations from Kx are 
much better than x, alone. Since Ky has dimension k (in general), we can 
actually use it to compute k best approximate eigenvalues and eigenvectors. 
These best approximations are called the Ritz values and Ritz vectors. 

We will concentrate on the symmetric case A = A’. In the last section we 
will briefly describe the nonsymmetric case. 

The rest of this chapter is organized as follows. Section 7.2 discusses the 
Rayleigh—Ritz method, our basic technique for extracting information about 
eigenvalues and eigenvectors from a Krylov subspace. Section 7.3 discusses 
our main algorithm, the Lanczos algorithm, in exact arithmetic. Section 7.4 
analyzes the rather different behavior of the Lanczos algorithm in floating 
point arithmetic, and sections 7.5 and 7.6 describe practical implementations 
of Lanczos that compute reliable answers despite roundoff. Finally, section 7.7 
briefly discusses algorithms for the nonsymmetric eigenproblem. 


7.2. The Rayleigh—Ritz Method 


Let Q = [Qk, Qu] be any n-by-n orthogonal matrix, where Q; is n-by-k and 
Qu is n-by-(n — k). In practice the columns of Q; will be computed by the 
Lanczos algorithm (Algorithm 6.10 or Algorithm 7.1 below) and span a Krylov 
subspace Kx, and the subscript u indicates that Qu is (mostly) unknown. But 
for now we do not care where we get Q. 

We will use the following notation (which was also used in equation (6.31): 


Q AQr QE AQu 


T = QAQ = (Qr, Qu] Alk, Qa] z QT AQ; QT AQu 
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When k = 1, Tẹ is just the Rayleigh quotient Tı = p(Q1, A) (see Definition 5.1). 
So for k > 1, Tk is a natural generalization of the Rayleigh quotient. 


DEFINITION 7.1. The Rayleigh—Ritz procedure is to approximate the eigen- 
values of A by the eigenvalues of Tk = QTTQk. These approximations are 
called Ritz values. Let Ty = VAVT be the eigendecomposition of Ty. The cor- 
responding eigenvector approximations are the columns of QkV and are called 
Ritz vectors. 


The Ritz values and Ritz vectors are considered optimal approximations 
to the eigenvalues and eigenvectors of A for several reasons. First, when Qk 
and so Tą are known but Qu and so Tku and T, are unknown, the Ritz values 
and vectors are the natural approximations from the known part of the matrix. 
Second, they satisfy the following generalization of Theorem 5.5. (Theorem 5.5 
showed that the Rayleigh quotient was a “best approximation” to a single 
eigenvalue.) Recall that the columns of Qk span an invariant subspace of A if 
and only if AQ; = Qk R for some matrix R. 


THEOREM 7.1. The minimum of |AQk — QkRl|2 over all k-by-k symmetric 
matrices R is attained by R = Ty, in which case || AQk — Qk Rll = ||Tkull2. Let 
Tr = VAVT be the eigendecomposition of Ty. The minimum of || AP, — P D|]2 
over all n-by-k orthogonal matrices P, where span(P;,) = span(Qk) and D is 
diagonal is also ||Tkul|2 and is attained by Pe = QkV and D= A. 


In other words, the columns of QkV (the Ritz vectors) are the “best” 
approximate eigenvectors and the diagonal entries of A (the Ritz values) are 


the “best” approximate eigenvalues in the sense of minimizing the residual 
|| AP, — Pk D||2. 


Proof. We temporarily drop the subscripts k on Tk and Qk to simplify 
notation, so we can write the k-by-k matrix T = QT AQ. Let R= T+ Z. We 
want to show || AQ — QR\||} is minimized when Z = 0. We do this by using a 
disguised form of the Pythagorean theorem: 


AQ — QRI|5 Amax [(AQ — QR)” (AQ - QR)] 
by Part 7 of Lemma 1.7 
Amax [(4Q — Q(T + Z))* (AQ - Q(T + Z))] 
Amax [(AQ — QT)"(AQ — QT) — (AQ - QT)” (QZ) 
~(QZ)" (AQ - QT) + (QZ)"(QZ)| 
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= Amax [(AQ — QT)" (AQ — QT) — (QTAQ -T)Z 
—Z"(QTAQ - T) + Z"Z] 

= Amax (AQ - QT)"(AQ - QT) + Z7 Z] 

because QTAQ =T 

Amax [(4Q — QT)" (AQ - QT)] 
by Question 5.5, since ZT Z is 


symmetric positive semidefinite 


= _||AQ -QTI|} by Part 7 of Lemma 1.7. 


IV 


Restoring subscripts, it is easy to compute the minimum value 


|AQr — QeTilla = ||(QrTk + QuT ku) — (QeTe) Ilo = QuTkull2 = ||Teull- 


If we replace Qg by any product Q;U where U is another orthogonal matrix, 
then the columns of Qk and Q;U span the same space, and 


| 4Qr — Qe Rllz = || AQU — QeRU||2 = || A(QkU) — (QeU)(U? RU) |I2. 


These quantities are still minimized when R = Tk, and by choosing U = V 
so that UT;,U is diagonal, we solve the second minimization problem in the 
statement of the theorem. 

This theorem justifies using Ritz values as eigenvalue approximations. When 
Qk is computed by the Lanczos algorithm, in which case (see equation (6.31)) 


les bı | 
Bie: “td °F 


i o Êk- | 
r=| Th Th] =| r-i Ok | Êr | 
Tku | Tu Be | Ok+1 Peta 


| Bett > ay | 


| g ey Bn-1 | 
Bn—1 An 


then it is easy to compute all the quantities in Theorem 7.1. This is because 
there are good algorithms for finding eigenvalues and eigenvectors of the sym- 
metric tridiagonal matrix Tk (see section 5.3) and because the residual norm is 
simply ||Tkull2 = 8k. (From the Lanczos algorithm we know that 3p is nonneg- 
ative.) This simplifies the error bounds on the approximate eigenvalues and 
eigenvectors in the following theorem. 


THEOREM 7.2. Let Ty, Tku, and Qg be as in equation (7.1). If Qk is computed 
by the Lanczos algorithm, let Bk be the single (possibly) nonzero entry in the 
upper right corner of Teu. Let Ty = VAVT be the eigendecomposition of Ty, 
where V = |v1,..., Ux] is orthogonal and A = diag(01,..., 0k). Then 
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1. There are k eigenvalues a1,...,a,% of A (not necessarily the largest k) 
such that |0; — ail < ||Tkull2 for? = 1,...,k. If Qk is computed by the 
Lanczos algorithm, then |0; — ail < ||Trull2 = 8k- 


2. ||A(Qzeui) — (Qx vi) 9; lo = ||Tkuvill2. Thus, the difference between the Ritz 
value 0; and some eigenvalue a of A is at most ||Tkuvill2, which may be 
much smaller than ||Tkull2. If Qk is computed by the Lanczos algorithm, 
then ||Tkuvill2 = Cxlvi(k)|, where vi(k) is the kth (bottom) entry of vi. 
This formula lets us compute the residual || A(Qxv;) — (Qkvi)bill2 cheaply, 
i.e., without multiplying any vector by Qk or by A. 


3. Without any further information about the spectrum of Tu, we cannot 
deduce any useful error bound on the Ritz vector Q;v;. If we know that 
the gap between 0; and any other eigenvalue of Tk or Ty is at least g, 
then we can bound the angle 0 between Qgui and a true eigenvector of A 
by 


* sin 26 a WF (7.2) 
2 g 


If Qk is computed by the Lanczos algorithm, then the bound simplifies to 


1 
— sin 20 < Pr 
2 g 
Proof. 
1. The eigenvalues of T = [ P r ] include 01 through 0%. Since 


0 TE, 


t-re= ||, 5 | = ele 


Weyl’s theorem (Theorem 5.1) tells us that the eigenvalues of T and T 
differ by at most ||Tkull2. But the eigenvalues of T and A are identical, 
proving the result. 


2. We compute 
| A(Qevi) — (Qevi)Gill2 = ]Q7A(Qevi) — Q7(Quvi) Alle 


2 Tevi | | ði _ 
= Thy Vi 0 2 7 


since Tku; = 0;0; 


0 
Tkuvi 


2 


= |Thuville- 


Then by Theorem 5.5, A has some eigenvalue a satisfying ja — 0;| < 
||Tkuvill2. If Qk is computed by the Lanczos algorithm, then ||Tkuvill2 = 
Gx,|v;(k)|, because only the top right entry of Tku, namely, Bk, is nonzero. 
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3. We reuse Example 5.4 to show that we cannot deduce a useful error bound 
on the Ritz vector without further information about the spectrum of Tu: 


_ | 1+g e 
A 


where 0 < e < g. We let k = 1 and Qı = [ei], so Ti = 1 +g and 
the approximate eigenvector is simply e1. But as shown in Example 5.4, 
the eigenvectors of T are close to [1,¢/g|’ and [—e/g,1]?. So without 
a lower bound on g, i.e., the gap between the eigenvalue of Tẹ and all 
the other eigenvalues, including those of Tu, we cannot bound the error 
in the computed eigenvector. If we do have such a lower bound, we can 
apply the second bound of Theorem 5.4 to T and T + E = diag(Tk, Tu) 
to derive equation (7.2). © 


7.3. The Lanczos Algorithm in Exact Arithmetic 


The Lanczos algorithm for finding eigenvalues of a symmetric matrix A com- 
bines the Lanczos algorithm for building a Krylov subspace (Algorithm 6.10) 
with the Rayleigh—Ritz procedure of the last section. In other words, it builds 
an orthogonal matrix Qk = [q1, -- -, qk] of orthogonal Lanczos vectors and ap- 
proximates the eigenvalues of A by the Ritz values (the eigenvalues of the 
symmetric tridiagonal matrix Ty = QT AQ;), as in equation (7.1). 


ALGORITHM 7.1. Lanczos Algorithm in exact arithmetic for finding eigenval- 
ues and eigenvectors of A= A’: 


qı = b/|lbll2; G0 = 9, go = 9 


forj=1tok 

z= Aqj 

aj = qz 

z = Z — Qjqj — Pj-1qj-1 

6; = |lzll2 

if Bj =0, quit 

qj+1 = 2/8; 

Compute eigenvalues, eigenvectors, and error bounds of Tk 
end for 


In this section we explore the convergence of the Lanczos algorithm by de- 
scribing a numerical example in some detail. This example has been chosen to 
illustrate both typical convergence behavior, as well as some more problematic 
behavior, which we call misconvergence. Misconvergence can occur because 
the starting vector qı is nearly orthogonal to the eigenvector of the desired 
eigenvalue or when there are multiple (or very close) eigenvalues. 


Iterative Methods for Eigenvalue Problems 369 


The title of this section indicates that we have (nearly) eliminated the 
effects of roundoff error on our example. Of course, the Matlab code (HOME- 
PAGE/Matlab/LanczosFullReorthog.m) used to produce the example below 
ran in floating point arithmetic, but we implemented Lanczos (in particular 
the inner loop of Algorithm 6.10) in a particularly careful and expensive way 
in order to make it mimic the exact result as closely as possible. This careful 
implementation is called Lanczos with full reorthogonalization, as indicated in 
the titles of the figures below. 

In the next section we will explore the same numerical example using the 
original, inexpensive implementation of Algorithm 6.10, which we call Lanc- 
zos with no reorthogonalization in order to contrast it with Lanczos with full 
reorthogonalization. (We will also explain the difference in the two implementa- 
tions.) We will see that the original Lanczos algorithm can behave significantly 
differently from the more expensive “exact” algorithm. Nevertheless, we will 
show how to use the less expensive algorithm to compute eigenvalues reliably. 


EXAMPLE 7.1. We illustrate the Lanczos algorithm and its error bounds by 
running a large example, a 1000-by-1000 diagonal matrix A, most of whose 
eigenvalues were chosen randomly from a normal Gaussian distribution. Fig- 
ure 7.1 is a plot of the eigenvalues. To make later plots easy to understand, 
we have also sorted the diagonal entries of A from largest to smallest, so 
d;(A) = ai, with corresponding eigenvector e;, the ith column of the identity 
matrix. There are a few extreme eigenvalues, and the rest cluster near the 
center of the spectrum. The starting Lanczos vector qı has all equal entries, 
except for one, as described below. 

There is no loss in generality in experimenting with a diagonal matrix, since 
running Lanczos on A with starting vector qı is equivalent to running Lanczos 
on QT AQ with starting vector Q? q, (see Question 7.1). 

To illustrate convergence, we will use several plots of the sort shown in 
Figure 7.2. In this figure the eigenvalues of each Tẹ are shown plotted in 
column k, for k = 1 to 9 on the top, and for k = 1 to 29 on the bottom, with 
the eigenvalues of A plotted in an extra column at the right. Thus, column 
k has k pluses, one marking each eigenvalue of T;. We have also color-coded 
the eigenvalues as follows: The largest and smallest eigenvalues of each Tk are 
shown in black, the second largest and second smallest eigenvalues are red, the 
third largest and third smallest eigenvalues are green, and the fourth largest 
and fourth smallest eigenvalues are blue. Then these colors recycle into the 
interior of the spectrum. 

To understand convergence, consider the largest eigenvalue of each Tk; these 
black pluses are on the top of each column. Note that they increase monoton- 
ically as k increases; this is a consequence of the Cauchy interlace theorem, 
since Tk is a submatrix of T,41 (see Question 5.4). In fact, the Cauchy inter- 
lace theorem tells us more, that the eigenvalues of Tẹ interlace those of Tk+1, 
or that Ai(Tk+1) > Ai(Tk) > Ai+1ı(Tk+1) > Ai+1 (Tk). In other words, Vi (Tr) 
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Eigenvalues of A 
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Fig. 7.1. Eigenvalues of the diagonal matrix A. 


increases monotonically with k for any fixed i, not just i = 1 (the largest eigen- 
value). This is illustrated by the colored sequences of pluses moving right and 
up in the figure. 

A completely analogous phenomenon occurs with the smallest eigenvalues: 
The bottom black plus sign in each column of Figure 7.2 shows the smallest 
eigenvalue of each Tk, and these are monotonically decreasing as k increases. 
Similarly, the ith smallest eigenvalue is also monotonically decreasing. This is 
also a simple consequence of the Cauchy interlace theorem. 

Now we can ask to which eigenvalue of A the eigenvalue \;(7j,) can converge 
as k increases. Clearly the largest eigenvalue of Tk, A1 (Tk), ought to converge 
to the largest eigenvalue of A, \;(A). Indeed, if Lanczos proceeds to step k = n 
(without quitting early because some 6p = 0), then Tn and A are similar, and 
so Ai(T,) = A1(A). Similarly, the ith largest eigenvalue A;(Tk) of Tẹ must 
increase monotonically and converge to the ith largest eigenvalue A;(A) of A 
(provided that Lanczos does not quit early). And the ith smallest eigenvalue 
Ak+1—:i(Tk) of Tk must similarly decrease monotonically and converge to the 
ith smallest eigenvalue A,41~;(A) of A. 

All these converging sequences are represented by sequences of pluses of a 
common color in Figure 7.2 and other figures in this section. Consider the right 
graph in Figure 7.2: For k larger than about 15, the topmost and bottom-most 
black pluses form horizontal rows next to the extreme eigenvalues of A, which 
are plotted in the rightmost column; this demonstrates convergence. Similarly, 
the outermost sequences of red pluses form horizontal rows next to the second 
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9 steps of Lanczos (full reorthogonalization) applied to A 
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Lanczos step 


29 steps of Lanczos (full reorthogonalization) applied to A 
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Fig. 7.2. The Lanczos algorithm applied to A. The first 9 steps are shown on the 
top, and the first 29 steps are shown on the bottom. Column k shows the eigenvalues 
of Tk, except that the rightmost columns (column 10 on the left and column 30 on the 
right) show all the eigenvalues of A. 


largest and second smallest eigenvalues of A in the rightmost column; they 
converge later than the outermost eigenvalues. A blow-up of this behavior for 
more Lanczos steps is shown in the top two graphs of Figure 7.3. 

To summarize the above discussion, estreme eigenvalues, i.e., the largest 
and smallest ones, converge first, and the interior eigenvalues converge last. 
Furthermore, convergence is monotonic, with the ith largest (smallest) eigen- 
value of Tk increasing (decreasing) to the ith largest (smallest) eigenvalue of 
A, provided that Lanczos does not stop prematurely with some Bk = 0. 

Now we examine the convergence behavior in more detail, compute the 
actual errors in the Ritz values, and compare these errors with the error bounds 
in part 2 of Theorem 7.2. We run Lanczos for 99 steps on the same matrix 
pictured in Figure 7.2 and display the results in Figure 7.3. The top left graph 
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in Figure 7.3 shows only the largest eigenvalues, and the top right graph shows 
only the smallest eigenvalues. 

The middle two graphs in Figure 7.3 show the errors in the four largest 
computed eigenvalues (on the left) and the four smallest computed eigenvalues 
(on the right). The colors in the middle graphs match the colors in the top 
graphs. We measure and plot the errors in three ways: 


e The global errors (the solid lines) are given by |A; (Zh) — Ai(A)|/|Ai(A)]. 
We divide by |\;(A)| in order to normalize all the errors to lie between 1 
(no accuracy) and about 10716 (machine epsilon, or full accuracy). As k 
increases, the global error decreases monotonically, and we expect it to 
decrease to machine epsilon, unless Lanczos quits prematurely. 


e The local errors (the dotted lines) are given by 
min; |A:(Tk) — Aj(A)|/|Ai(A)|. The local error measures the smallest dis- 
tance between \;(Zj,) and the nearest eigenvalue \;(A) of A, not just the 
ultimate value ;(A). We plot this because sometimes the local error is 
much smaller than the global error. 


e The error bounds (the dashed lines) are the quantities 
|G,v;(k)|/|A;(A)| computed by the algorithm (except for the normaliza- 
tion by |A;(A)|, which of course the algorithm does not know!). 


The bottom two graphs in Figure 7.3 show the eigenvector components 
of the Lanczos vectors gq, for the four eigenvectors corresponding to the four 
largest eigenvalues (on the left) and for the four eigenvectors corresponding 
to the four smallest eigenvalues (on the right). In other words, they plot 
dpe; = qk(j), where ej is the jth eigenvector of the diagonal matrix A, for 
k = 1 to 99 and for j = 1 to 4 (on the left) and j = 997 to 1000 (on the 
right). The components are plotted on a logarithmic scale, with “+” and “o” 
to indicate whether the component is positive or negative, respectively. We 
use these plots to help explain convergence below. 

Now we use Figure 7.3 to examine convergence in more detail. The largest 
eigenvalue of Tẹ (topmost black pluses in the top left graph of Figure 7.3) 
begins converging to its final value (about 2.81) right away, is correct to six 
decimal places after 25 Lanczos steps, and is correct to machine precision by 
step 50. The global error is shown by the solid black line in the middle left 
graph. The local error (the dotted black line) is the same as the global error 
after not too many steps, although it can be “accidentally” much smaller if 
an eigenvalue \;(Zj;,) happens to fall close to some other \;(A) on its way to 
d;(A). The dashed black line in the same graph is the relative error bound 
computed by the algorithm, which overestimates the true error up to about 
step 75. Still, the relative error bound correctly indicates that the largest 
eigenvalue is correct to several decimal digits. 

The second through fourth largest eigenvalues (the topmost red, green and 
blue pluses in the top left graph of Figure 7.3) converge in a similar fashion, 
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99 steps of Lanczos (full reorthogonalization) applied to A 
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Fig. 7.3. 99 steps of Lanczos applied to A. The the largest eigenvalues are shown 
on the left, and the smallest on the right. The top two graphs show the eigenvalues 
themselves, the middle two graphs the errors (global = solid, local = dotted, bounds 


= dashed), and the bottom two graphs show eigencomponents of Lanczos vectors. The 
colors in a column of three graphs match. 
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with eigenvalue 7 converging slightly faster than eigenvalue i+1. This is typical 
behavior of the Lanczos algorithm. 

The bottom left graph of Figure 7.3 measures convergence in terms of 
the eigenvector components gp ej. To explain this graph, consider what hap- 
pens to the Lanczos vectors qk as the first eigenvalue converges. Convergence 
means that the corresponding eigenvector e1 nearly lies in the Krylov subspace 
spanned by the Lanczos vectors. In particular, since the first eigenvalue has 
converged after k = 50 Lanczos steps, this means that e; must very nearly be 
a linear combination of qı though qs59. Since the qk are mutually orthogonal, 
this means qą must also be orthogonal to e; for k > 50. This is borne out by 
the black curve in the bottom left graph, which has decreased to less than 107" 
by step 50. The red curve is the component of e in qk, and this reaches 1078 
by step 60. The green curve (third eigencomponent) and blue curve (fourth 
eigencomponent) get comparably small a few steps later. 

Now we discuss the smallest four eigenvalues, whose behavior is described 
by the three graphs on the right of Figure 7.3. We have chosen the matrix 
A and starting vector qı to illustrate certain difficulties that can arise in the 
convergence of the Lanczos algorithm to show that convergence is not always 
as straightforward as in the case of the four eigenvalues just examined. 

In particular, we have chosen qı(999), the eigencomponent of qı in the 
direction of the second smallest eigenvalue (—2.81), to be about 1077, which is 
10° times smaller than all the other components of qı, which are equal. Also, 
we have chosen the third and fourth smallest eigenvalues (numbers 998 and 
997) to be nearly the same: —2.700001 and —2.7. 

The convergence of the smallest eigenvalue of Tk to A1000 4) © —3.03 is 
uneventful, similar to the largest eigenvalues. It is correct to 16 digits by step 
AO. 

The second smallest eigenvalue of Tk, shown in red, begins by misconverging 
to the third smallest eigenvalue of A, near —2.7. Indeed, the dotted red line in 
the middle right graph of Figure 7.3 shows that Agg9(Tk) agrees with Aggg(A) 
to six decimal places for Lanczos steps 40 < k < 50. The corresponding error 
bound (the red dashed line) tells us that Ag99(Tk) equals some eigenvalue of A 
to three or four decimal places for the same values of k. The reason Agg9(Th) 
misconverges is that the Krylov subspace starts with a very small component of 
the corresponding Krylov subspace e999, namely, 1077. This can be seen by the 
red curve in bottom right graph, which starts at 1077 and takes until step 45 
before a large component of e999 appears. Only at this point, when the Krylov 
subspace contains a sufficiently large component of the eigenvector e999, can 
Agog(Ty,) start converging again to its final value Ag99(A) ~ —2.81, as shown 
in the top and middle right graphs. Once this convergence has set in again, 
the component of eg99 starts decreasing again and becomes very small once 
Ao99(Tk) has converged to A999(A) sufficiently accurately. (For a quantitative 
relationship between the convergence rate and the eigencomponent qf e999, see 
the theorem of Kaniel and Saad discussed below.) 
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Fig. 7.4. Lanczos applied to A, where the starting vector qı is orthogonal to the 
eigenvector corresponding to the second smallest eigenvalue —2.81. No approximation 
to this eigenvalue is computed. 


Indeed, if qı were exactly orthogonal to egg9, so qT e999 = Q rather than 
just q+ e999 = 1077, then all later Lanczos vectors would also be orthogonal to 
qi. This means Agg9(Tj,) would never converge to Agg99(A). (For a proof, see 
Question 7.3.) We illustrate this in Figure 7.4, where we have modified qi just 
slightly so that qł e999 = 0. Note that no approximation to Ag999(A) ~ —2.81 
ever appears. 

Fortunately, if we choose qı at random, it is extremely unlikely to be or- 
thogonal to an eigenvector. We can always rerun Lanczos with a different 
random qı to provide more “statistical” evidence that we have not missed any 
eigenvalues. 

Another source of “misconvergence” are (nearly) multiple eigenvalues, such 
as the the third smallest eigenvalue A99g(A) = —2.700001 and the fourth 
smallest eigenvalue A997(A) = —2.7. By examining Agog (Tj), the bottom- 
most green curve in the top right and middle right graphs of Figure 7.3, we 
see that during Lanczos steps 50 < k < 75, Aggs(T.) misconverges to about 
—2.7000005, halfway between the two closest eigenvalues of A. This is not 
visible at the resolution provided by the top right graph but is evident from 
the horizontal segment of the solid green line in the middle right graph during 
Lanczos steps 50 < k < 75. At step 76 rapid convergence to the final value 
Aggs(A) = —2.700001 sets in again. 

Meanwhile, the fourth smallest eigenvalue àg97(Tk), shown in blue, has 
misconverged to a value near Ag96(A) ~ —2.64; the blue dotted line in the 
middle right graph indicates that Ag97(Tķ) and Aggg(A) agree to up to nine 
decimal places near step k = 61. At step k = 65 rapid convergence sets in 
again to the final value 997(A) = —2.7. This can also be seen in the bottom 
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Fig. 7.5. Lanczos applied to A, where the third and fourth smallest eigenvalues are 
equal. Only one approximation to this double eigenvalue is computed. 


right graph, where the eigenvector components of e997 and eggs grow again 
during step 50 < k < 65, after which rapid convergence sets in and they again 
decrease. 

Indeed, if A997(A) were exactly a double eigenvalue, we claim that Ty would 
never have two eigenvalues near that value but only one (in exact arithmetic). 
(For a proof, see Question 7.3.) We illustrate this in Figure 7.5, where we have 
modified A just slightly so that it has two eigenvalues exactly equal to —2.7. 
Note that only one approximation to Ag99g(A) = Ag97(A) = —2.7 ever appears. 

Fortunately, there are many applications were it is sufficient to find one 
copy of each eigenvalue rather than all multiple copies. Also, it is possible to 
use “block Lanczos” to recover multiple eigenvalues (see the algorithms cited 
in section 7.6). 

Examining other eigenvalues in the top right graph of Figure 7.3, we see 
that misconvergence is quite common, as indicated by the frequent short hor- 
izontal segments of like-colored pluses, which then drop off to the right to the 
next smaller eigenvalue. For example, the seventh smallest eigenvalue is well- 
approximated by the fifth (black), sixth (red), and seventh (green) smallest 
eigenvalues of Tę at various Lanczos steps. 

These misconvergence phenomena explain why the computable error bound 
provided by part 2 of Theorem 7.2 is essential to monitor convergence. If the 
error bound is small, the computed eigenvalue is indeed a good approximation 
to some eigenvalue, even if one is “missing.” © 


There is another error bound, due to Kaniel and Saad, that sheds light on 
why misconvergence occurs. This error bound depends on the angle between 
the starting vector qı and the desired eigenvectors, the Ritz values, and the 
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desired eigenvalues. In other words, it depends on quantities unknown during 
the computation, so it is not of practical use. But it shows that if qı is nearly 
orthogonal to the desired eigenvector, or if the desired eigenvalue is nearly 
multiple, then we can expect slow convergence. See [195, sect. 12-4] for details. 


7.4. The Lanczos Algorithm in Floating Point Arith- 
metic 


The example in the last section described the behavior of the “ideal” Lanczos 
algorithm, essentially without roundoff. We call the corresponding careful but 
expensive implementation of Algorithm 6.10 Lanczos with full reorthogonaliza- 

tion to contrast it with the original inexpensive implementation, which we call 
Lanczos with no reorthogonalization (HOMEPAGE/Matlab/LanczosNoReorthog.m). 
Both algorithms are shown below. 


ALGORITHM 7.2. Lanczos algorithm with full or no reorthogonalization for 
finding eigenvalues and eigenvectors of A= A’: 


qı = b/|lbll2; 80 = 9, qo = 0 


forj=1 tok 
z= Ag; 
oj=gz 
{ z=z— Ih (2T qi)qi, z=z— Sii (2T aa) ai full reorthogonalization 
z = Z — Qjqj — j-14j-1 no reorthogonalization 
6; = alle 
if Bj =0, quit 
qj+1 = 2/8; 
Compute eigenvalues, eigenvectors, and error bounds of Tk 
end for 


Full reorthogonalization corresponds to applying the Gram-Schmidt or- 
thogonalization process “z = z— i (zTqi)q;? twice in order to almost surely 
make z orthogonal to qı through qj—1. (See Algorithm 3.1 as well as [195, sect. 
6-9] and [169, chap. 7] for discussions of when “twice is enough.” ) In exact 
arithmetic, we showed in section 6.6.1 that z is orthogonal to qı through qj—1 
without reorthogonalization. Unfortunately, we will see that roundoff destroys 
this orthogonality property, upon which all of our analysis has depended so 
far. 

This loss of orthogonality does not cause the algorithm to behave com- 
pletely unpredictably. Indeed, we will see that the price we pay is to get 
multiple copies of converged Ritz values. In other words, instead of Tk having 
one eigenvalue nearly equal to \;(A) for k large, it may have many eigenvalues 
nearly equal to \;(A). This is not a disaster if one is not concerned about 
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computing multiplicities of eigenvalues and does not mind the resulting de- 
layed convergence of interior eigenvalues. See [56] for a detailed description of 
a Lanczos implementation that operates in this fashion, and NETLIB/lanczos 
for the software itself. 

But if accurate multiplicities are important, then one needs to keep the 
Lanczos vectors (nearly) orthogonal. So one could use the Lanczos algorithm 
with full reorthogonalization, as we did in the last section. But one can easily 
confirm that this costs O(k?n) flops instead of O(kn) flops for k steps, and 
O(kn) space instead of O(n) space, which may be too high a price to pay. 

Fortunately, there is a middle ground between no reorthogonalization and 
full reorthogonalization, which nearly gets the best of both worlds. It turns 
out that the q lose their orthogonality in a very systematic way by developing 
large components in the directions of already converged Ritz vectors. (This is 
what leads to multiple copies of converged Ritz values.) This systematic loss 
of orthogonality is illustrated by the next example and explained by Paige’s 
theorem below. We will see that by monitoring the computed error bounds, we 
can conservatively predict which q, will have large components of which Ritz 
vectors. Then we can selectively orthogonalize qx against just those few prior 
Ritz vectors, rather than against all the earlier q;s at each step, as with full 
reorthogonalization. This keeps the Lanczos vectors (nearly) orthogonal for 
very little extra work. The next section discusses selective orthogonalization 
in detail. 


EXAMPLE 7.2. Figure 7.7 shows the convergence behavior of 149 steps of Lanc- 
zos on the matrix in Example 7.1. The graphs on the right are with full re- 
orthogonalization, and the graphs on the left are with no reorthogonalization. 
These graphs are similar to those in Figure 7.3, except that the global error is 
omitted, since this clutters the middle graphs. 

Figure 7.6 plots the smallest singular value omin(Q;) versus Lanczos step 
k. In exact arithmetic, Qk is orthogonal and so Omin(Qx) = 1. With roundoff, 
Qk loses orthogonality starting at around step k = 70, and Cmin(Qk) drops to 
.01 by step k = 80, which is where the top two graphs in Figure 7.7 begin to 
diverge visually. 

In particular, starting at step k = 80 in the top left graph of Figure 7.7, the 
second smallest (red) eigenvalue »2(T;,), which had converged to A2(A) ~ 2.7 
to almost 16 digits, leaps up to \;(A) ~ 2.81 in just a few steps, yielding a 
“second copy” of \;(A) along with \;(Z;,) (in black). (This may be hard to see, 
since the red pluses overwrite and so obscure the black pluses.) This transition 
can be seen in the leap in the dashed red error bound in the middle left graph. 
Also, this transition was “foreshadowed” by the increasing component of e1 
in the bottom left graph, where the black curve starts rising again at step 
k = 50 rather than continuing to decrease to machine epsilon, as it does with 
full reorthogonalization in the bottom right graph. Both of these indicate 
that the algorithm is diverging from its exact path (and that some selective 
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Fig. 7.6. Lanczos algorithm without reorthogonalization applied to A. The smallest 
singular value Omin(Qz) of the Lanczos vector matriz Qk is shown for k = 1 to 149. 
In the absence of roundoff, Q is orthogonal, and so all singular values should be one. 
With roundoff, Qk becomes rank deficient. 


orthogonalization is called for). After the second copy of A;(A) has converged, 
the component of e in the Lanczos vectors starts dropping again, starting a 
little after step k = 80. 

Similarly, starting at about step k = 95, a second copy of A2(A) appears 
when the blue curve (A4(7j,)) in the upper left graph moves from about \3(A) ~ 
2.6 to A2(A) © 2.7. At this point we have two copies of à; (A) ~ 2.81 and two 
copies of \2(A). This is a bit hard to see on the graphs, since the pluses 
of one color obscure the pluses of the other color (red overwrites black, and 
blue overwrites green). This transition is indicated by the dashed blue error 
bound for A4(Z;) in the middle left graph rising sharply near k = 95 and is 
foreshadowed by the rising red curve in the bottom left graph, which indicates 
that the component of e2 in the Lanczos vectors is rising. This component 
peaks near k = 95 and starts dropping again. 

Finally, around step k = 145, a third copy of \1(A) appears, again indicated 
and foreshadowed by changes in the two bottom left graphs. If we were to 
continue the Lanczos process, we would periodically get additional copies of 
many other converged Ritz values. © 


The next theorem provides an explanation for the behavior seen in the 
above example, and hints at a practical criterion for selectively orthogonalizing 
Lanczos vectors. In order not to be overwhelmed by taking all possible roundoff 
errors into account, we will draw on others’ experience to identify those few 
rounding errors that are important, and simply ignore the rest [195, sect. 13- 
4]. This lets us summarize the Lanczos algorithm with no reorthogonalization 
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149 steps of Lanczos (full reorthogonalization) applied to A 
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Fig. 7.7. 


Lanczos step 


149 step of Lanczos applied to A. Column 150 (at the right of the top 


graphs) shows the eigenvalues of A. In the left graphs, no reorthogonalization is done. 
In the right graphs, full reorthogonalization is done. 
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in one line: 
Bidj+i + fj = Aq; — a59; — Bj-14;-1- (7.3) 


In this equation the variables represent the values actually stored in the ma- 
chine, except for f;, which represents the roundoff error incurred by evaluat- 
ing the right-hand side and then computing 8; and qj41. The norm ||fj|l2 is 
bounded by O(e|| A||), where £ is machine epsilon, which is all we need to know 
about fj. In addition, we will write Te = VAV? exactly, since we know that 
the roundoff errors occurring in this eigendecomposition are not important. 
Thus, Qk is not necessarily an orthogonal matrix, but V is. 


‘THEOREM 7.3. Paige. We use the notation and assumptions of the last para- 
graph. We also let Qk = [m,---;%], V = [v1,---, Ue], and A = diag(1,...,0,). 
We continue to call the columns Yki = Qrvi of QkV the Ritz vectors and the 
6; the Ritz values. Then 


O(E||All) 
T = . 
Ikili = B Tu;(k)| 


In other words the component Yt idk+1 of the computed Lanczos vector 
de+1 in the direction of the Ritz vector Yki = Qkvi is proportional to the 
reciprocal of (3;|v;(k)|, which is the error bound on the corresponding Ritz 
value 9; (see Part 2 of Theorem 7.2). Thus, when the Ritz value 0; converges 
and its error bound (|v;(k)| goes to zero, the Lanczos vector qk+1 acquires a 
large component in the direction of Ritz vector y,;. Thus, the Ritz vectors 
become linearly dependent, as seen in Example 7.2. Indeed, Figure 7.8 plots 
both the error bound |6;v;(k)|/|A;(A)| ~ |G,0;()|/||Al] and the Ritz vector 
component Yh k+l for the largest Ritz value (¢ = 1, the top graph) and for 
the second largest Ritz value (i = 2, the bottom graph) of our 1000-by-1000 
diagonal example. According to Paige’s theorem, the product of these two 
quantities should be O(<). Indeed it is, as can be seen by the symmetry of the 
curves about the middle line \/e of these semilogarithmic graphs. 

Proof of Paige’s theorem. We start with equation (7.3) for j = 1 to j = k, and 
write these k equations as the single equation 


AQk = QkTy + [0,..-,0, Begusi] + Fk 
= QkTh + Beansien + Fp, 


where ef is the k-dimensional row vector [0,...,0,1] and Fy = [fi,---, fg] is 
the matrix of roundoff errors. We simplify notation by dropping the subscript 
k to get AQ = QT + Gqe! + F. Multiply on the left by QT to get Q7 AQ = 
QTQT + BQ* qe! + QTF. Since Q? AQ is symmetric, we get that QTQT + 
BQ" qe! + Q' F equals its transpose or, rearranging this equality, 


0 = (Q' QT —-TQ*Q) + B(Q* ge" — eq" Q) + (Q’F — F*Q). (7.4) 
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If 0 and v are a Ritz value and Ritz vector, respectively, so that Tv = 0v, then 
note that 

v" B(eq" Q)v = [Bu(k)] - la" (Qv)] (7.5) 
is the product of error bound Gu(k) and the Ritz vector component q? (Qv) = 
q' y, which Paige’s theorem says should be O(e||Al|). Our goal is now to 
manipulate equation (7.4) to get an expression for eq’ Q alone, and then use 
equation (7.5). 

To this end, we now invoke more simplifying assumptions about roundoff: 
Since each column of Q is gotten by dividing a vector z by its norm, the 
diagonal of QTQ is equal to 1 to full machine precision; we will suppose that it 
is exactly 1. Furthermore, the vector z’ = z—ajqj = z — (q} 2); computed by 
the Lanczos algorithm is constructed to be orthogonal to q;, so it is also true 
that qj41 and q; are orthogonal to nearly full machine precision. Thus yig = 
(Q7Q)j 41,5 = O(e); we will simply assume (QTQ);j+1,; = 0. Now write QTQ = 
I+C+C", where C is lower triangular. Because of our assumptions about 
roundoff, C is in fact nonzero only on the second subdiagonal and below. This 
means 


QTQT — TQTQ = (CT — TC) + (CTT — TOT), 


where we can use the zero structures of C and T to easily show that CT — TC 
is strictly lower triangular and CTT — TC? is strictly upper triangular. Also, 
since e is nonzero only in its last entry, eq’ Q is nonzero only in the last row. 
Furthermore, the structure of QTQ just described implies that the last two 
entries of the last row of eg’ Q are zero. So in particular, eq! Q is also strictly 
lower triangular and Q/ qe" is strictly upper triangular. Applying the fact that 
eq’ Q and CT — TC are both strictly lower triangular to equation (7.4) yields 


0= (CT — TC) — Beq’ Q+L, (7.6) 


where L is the strict lower triangle of QT F — FTQ. Multiplying equation (7.6) 
on the left by v? and on the right by v, using equation (7.5) and the fact that 
vI (CT — TC)v =v? Cv6 — bv? Cv = 0, yields 


v” B(eq’ Q)u = [v(k)] - [g" (Qv)] = v" Lv. 
Since |v" Lo] < ||L|| = O(|Q* F — F*Q||) = O(||F ll) = O(ellAll), we get 
[Gv(k)] - lg" (Qv)] = O(E||All), 


which is equivalent to Paige’s theorem. 


7.5. The Lanczos Algorithm with Selective Orthogonal- 
ization 


We discuss a variation of the Lanczos algorithm which has (nearly) the high ac- 
curacy of the Lanczos algorithm with full reorthogonalization but (nearly) the 
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Fig. 7.8. Lanczos with no reorthogonalization applied to A. The first 149 steps are 
shown for the largest eigenvalue (in black, at top) and for the second largest eigenvalue 
(in red, at bottom). The dashed lines are error bounds as before. The lines marked 
by z’s and o’s show YE idk+1; the component of Lancos vector k+1 in the direction of 
the Ritz vector for the largest Ritz value (i = 1, at top) or for the second largest Ritz 


value (i = 2, at bottom). 


384 Applied Numerical Linear Algebra 


low cost of the Lanczos algorithm with no reorthogonalization. This algorithm 
is called the Lanczos algorithm with selective orthogonalization. As discussed 
in the last section, our goal is to keep the computed Lanczos vectors q; as nearly 
orthogonal as possible (for high accuracy) by orthogonalizing them against as 
few other vectors as possible at each step (for low cost). Paige’s theorem (The- 
orem 7.3 in the last section) tells us that the qp lose orthogonality because they 
acquire large components in the direction of Ritz vectors yik = Q,v; whose 
Ritz values ð; have converged, as measured by the error bound (3,|v;(k)| be- 
coming small. This phenomenon was illustrated in Example 7.2. 

Thus, the simplest version of selective orthogonalization simply monitors 
the error bound (;|v;(k)| at each step, and when it becomes small enough, the 
vector z in the inner loop of the Lanczos algorithm is orthogonalized against 
YZS (Up RZ Yik- We consider (;|v;(k)| to be small when it is less than 
VEé||Al|, since Paige’s theorem tells us that the vector component PAZIA = 
yi n2/ ||z||2| is then likely to exceed ye. (In practice we may replace ||A|| by 
||Z;.||, since ||Z}|] is known and ||A|| may not be.) This leads to the following 
algorithm 


ALGORITHM 7.3. The Lanczos algorithm with selective orthogonalization for 
finding eigenvalues and eigenvectors of A= AT. 


qı = b/|lbll2; Go = 9, go = 9 


forj=ltok 
z= Aq; 
aj = GF 2 


z = Z — 0595 — Bj-19j-1 
/* Selectively orthogonalize against converged Ritz vectors */ 
for alli < k such that Bx|vi(k)| < Vél|Te| 


a= T (Yip) Yik 


end for 

8; = |lzll2 

if Bj =0, quit 

qj+1 = 2/8; 

Compute eigenvalues, eigenvectors, and error bounds of Tk 
end for 


The following example shows what will happen to our earlier 1000-by- 
1000 diagonal matrix when this algorithm is used (HOMEPAGE/Matlab/ 
LanczosSelectiveOrthog.m). 


EXAMPLE 7.3. The behavior of the Lanczos algorithm with selective orthog- 
onalization is visually indistinguishable from the behavior of the Lanczos al- 
gorithm with full orthogonalization shown in the three graphs on the right 
of Figure 7.7. In other words, selective orthogonalization provided as much 
accuracy as full orthogonalization. 
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The smallest singular values of all the Q4 were greater than 1— 1078, which 
means that selective orthogonalization did keep the Lanczos vectors orthogonal 
to about half precision, as desired. 

Figure 7.9 shows the Ritz values of the Ritz vectors selected for reorthogo- 
nalization. Since the selected Ritz vectors correspond to converged Ritz values 
and the largest and smallest Ritz values converge first, there are two graphs: 
the large converged Ritz values are at the top, and the small converged Ritz 
values are at the bottom. The top graph matches the Ritz values shown in 
the upper right graph in Figure 7.7 that have converged to at least half preci- 
sion. All together, 1485 Ritz vectors were selected for orthogonalization of a 
total possible 149*150/2 = 11175. Thus, selective orthogonalization did only 
1485/11175 ~ 13% as much work to keep the Lanczos vectors (nearly) orthog- 
onal as full reorthogonalization. 

Figure 7.10 shows how the Lanczos algorithm with selective reorthogonal- 
ization keeps the Lanczos vectors orthogonal just to the Ritz vectors for the 
largest two Ritz values. The graph at the top is a superposition of the two 
graphs in Figure 7.8, which show the error bounds and Ritz vectors compo- 
nents for the Lanczos algorithm with no reorthogonalization. The graph at the 
bottom is the corresponding graph for the Lanczos algorithm with selective or- 
thogonalization. Note that at step k = 50, the error bound for the largest 
eigenvalue (the dashed black line) has reached the threshold of ye. The Ritz 
vector is selected for orthogonalization (as shown by the top black pluses in the 
top of Figure 7.9), and the component in this Ritz vector direction disappears 
from the bottom graph of Figure 7.10. A few steps later, at k = 58, the error 
bound for the second largest Ritz value reaches \/e, and it too is selected for 
orthogonalization. The error bounds in the top graph continue to decrease 
to machine epsilon £ and stay there, whereas the error bounds in the bottom 
graph eventually grow again. © 


7.6. Beyond Selective Orthogonalization 


Selective orthogonalization is not the end of the story, because the symmetric 
Lanczos algorithm can be made even less expensive. It turns out that once a 
Lanczos vector has been orthogonalized against a particular Ritz vector y, it 
takes many steps before the Lanczos vector again requires orthogonalization 
against y. So much of the orthogonalization work in Algorithm 7.3 can be 
eliminated. Indeed, there is a simple and inexpensive recurrence for deciding 
when to reorthogonalize [222, 190]. Another enhancement is to use the error 
bounds to efficiently distinguish between converged and “misconverged” eigen- 
values [196]. A state-of-the-art implementation of the Lanczos algorithm is de- 
scribed in [123]. A different software implementation is available in ARPACK 
(NETLIB/scalapack/readme.arpack [169, 231]). 

If we apply Lanczos to the shifted and inverted matrix (A—aI)~!, then we 
expect the eigenvalues closest to o to converge first. There are other methods 
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Fig. 7.9. The Lanczos algorithm with selective orthogonalization applied to A. The 
Ritz values whose Ritz vectors are selected for orthogonalization are shown. 


to “precondition” a matrix A to converge to certain eigenvalues more quickly. 
For example, Davidson’s method [59] is used in quantum chemistry problems, 
where A is strongly diagonally dominant. It is also possible to combine David- 
son’s method with Jacobi’s method [227]. 


7.7. Iterative Algorithms for the Nonsymmetric Eigen- 
problem 


When A is nonsymmetric, the Lanczos algorithm described above is no longer 
applicable. There are two alternatives. 

The first alternative is to use the Arnoldi algorithm (Algorithm 6.9). Re- 
call that the Arnoldi algorithm computes an orthogonal basis Qą of a Krylov 
subspace K;,(qi, A) such that Qf AQ; = Hp is upper Hessenberg rather than 
symmetric tridiagonal. The Rayleigh—Ritz procedure is again to approximate 
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Fig. 7.10. The Lanczos algorithm with selective orthogonalization applied to A. The 
top graph show the first 149 step of the Lanczos algorithm with no reorthogonalization, 
and the bottom graph shows the Lanczos algorithm with selective orthogonalization. 
The largest eigenvalue is shown in black, and the second largest eigenvalue is shown 
in red. The dashed lines are error bounds as before. The lines marked by x’s and o's 
show YE idk+1; the component of Lancos vector k+1 in the direction of the Ritz vector 
for the largest Ritz value (i = 1, in black) or for the second largest Ritz value (i = 2, 
in red). Note that selective orthogonalization eliminates components these components 
after the first selective orthogonalizations at steps 50 (i = 1) and 58 (i = 2). 
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the eigenvalues of A by the eigenvalues of Hp. Since A is nonsymmetric, its 
eigenvalues may be complex and/or badly conditioned, so many of the at- 
tractive error bounds and monotonic convergence properties enjoyed by the 
Lanczos algorithm and described in section 7.3 no longer hold. Nonethe- 
less, effective algorithms and implementations exist. Good references include 
(152, 169, 210, 214, 215, 231] and the book [211]. The latest software is de- 
scribed in [169, 231] and may be found in NETLIB/scalapack/readme.arpack. 
The Matlab command speig (for “sparse eigenvalues”) uses this software. 

A second alternative is to use the nonsymmetric Lanczos algorithm. This al- 
gorithm attempts to reduce A to nonsymmetric tridiagonal form by a nonorthog- 
onal similarity. The hope is that it will be easier to find the eigenvalues of a 
(sparse!) nonsymmetric tridiagonal matrix than the Hessenberg matrix pro- 
duced by the Arnoldi algorithm. Unfortunately, the similarity transformations 
can be quite ill-conditioned, which means that the eigenvalues of the tridiag- 
onal and of the original matrix may greatly differ. In fact, it is not always 
possible to find an appropriate similarity because of a phenomenon known as 
“breakdown” [41, 132, 133, 197]. Attempts to repair breakdown by by a pro- 
cess called “lookahead” have been proposed, implemented, and analyzed in 
(16, 18, 54, 55, 63, 106, 200, 263, 264]. 

Finally, it is possible to apply subspace iteration (Algorithm 4.3) [19], 
Davidson’s algorithm [214], or the Jacobi-Davidson algorithm [228] to the 
sparse nonsymmetric eigenproblem. 


7.8. References and Other Topics for Chapter 7 


In addition to the references in sections 7.6 and 7.7, there are a number of good 
surveys available on algorithms for sparse eigenvalues problems: see [17, 50, 
123, 161, 195, 211, 260]. Parallel implementations are also discussed in [75]. 

In section 6.2 we discussed the existence of on-line help to choose from 
among the variety of iterative methods available for solving Ax = b. A similar 
project is underway for eigenproblems and will be incorporated in a future 
edition of this book. 


7.9. Questions for Chapter 7 


QUESTION 7.1. (Easy) Confirm that running the Arnoldi algorithm (Algo- 
rithm 6.9) or the Lanczos algorithm (Algorithm 6.10) on A with starting vector 
q yields the identical tridiagonal matrices Tọ (or Hessenberg matrices Hg) as 
running on QT AQ with starting vector Q? q. 


QUESTION 7.2. (Medium) Let A; be a simple eigenvalue of A. Confirm that 
if qı is orthogonal to the corresponding eigenvector of A, then the eigenvalues 
of the tridiagonal matrices Tj, computed by the Lanczos algorithm in exact 
arithmetic cannot converge to A; in the sense that the largest 7; computed 
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cannot have A; as an eigenvalue. Show by means of a 3-by-3 example, that an 
eigenvalue of some other Tę can equal A; “accidentally.” 


QUESTION 7.3. (Medium) Confirm that no symmetric tridiagonal matrix Tk 
computed by the Lanczos algorithm can have an exactly multiple eigenvalue. 
Show that if A has a multiple eigenvalue, then Lanczos applied to A must 
break down before the last step. 
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Lanczos algorithm, 306, 321 
linear equations, 44, 49 
normal equations, 118 
orthogonal transformations, 124 
polynomial evaluation, 16 
QR decomposition, 118, 119, 123 
secular equation, 224 
single precision iterative refine- 
ment, 60 
Strassen’s method, 69, 86 
substitution, 26 
SVD, 118, 119, 123, 128 
band matrices 
linear equations, 73, 76-79, 81, 
82 
symmetric eigenproblem, 185 
Bauer—Fike theorem, 150 
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biconjugate gradients, 321 
bidiagonal form, 131, 240, 308, 357 


condition number, 87 

dqds algorithm, 242 

LR iteration, 242 

perturbation theory, 207, 242, 
245, 246, 263 

qds algorithm, 242 

QR iteration, 242 

reduction, 166, 240, 253 

SVD, 246, 260 


bisection 


finding zeros of polynomials, 7, 
30 

SVD, 241, 242, 247, 249 

symmetric eigenproblem, 119, 201, 
210, 211, 228, 236, 241, 260 


BLAS (Basic Linear Algebra Sub- 


routines), 28, 64-72, 83, 86 

in Cholesky, 76, 91 

in Hessenberg reduction, 165 

in Householder transformations, 
137 

in nonsymmetric eigenproblem, 
184, 185 

in QR decomposition, 121 

in sparse Gaussian elimination, 


84 


block algorithms 


Cholesky, 64, 76, 91 

Gaussian elimination, 70-72 

Hessenberg reduction, 165 

Householder reflection, 137 

matrix multiplication, 66 

nonsymmetric eigenproblem, 184, 
185 

QR decomposition, 121, 137 

sparse Gaussian elimination, 83 
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block cyclic reduction, 266, 328-331, 
333, 357 
model problem, 277 
boundary value problem 
one-dimensional heat equation, 
77 
Dirichlet, 267 
eigenproblem, 270 
L-shaped region, 349 
Poisson’s equation, 267, 325, 349 
Toda lattice, 255 
bulge chasing, 169, 171, 213 


canonical form, 139, 140, 145 
generalized Schur for real regu- 
lar pencils, 179, 185 
generalized Schur for regular pen- 
cils, 178, 180, 185 
generalized Schur for singular 
pencils, 180, 185 
Jordan, 3, 19, 140, 141, 144- 
146, 150, 175, 176, 178, 180, 
184, 185, 188, 280 
Kronecker, iv, 179-182, 185, 187 
polynomial, 19 
real Schur, 147, 163, 184, 213 
Schur, 3, 140, 146-148, 152, 158, 
160, 163, 175, 178, 180, 184, 
185, 187, 188 
Weierstrass, iv, 173, 175, 176, 
178, 180, 181, 185, 187 
CAPSS, 84 
Cauchy interlace theorem, 261, 369 
Cauchy matrices, 85 
Cayley transform, 264 
Cayley—Hamilton theorem, 296 
CG, see conjugate gradients 306 
CGS, see Gram-Schmidt orthog- 
onalization process (classi- 


cal); conjugate gradients squared 


characteristic polynomial, 140, 149, 
296 
companion matrix, 302 
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of A — AB, 173 
of Rsorw); 290 
of a matrix polynomial, 182 
secular equation, 218, 225, 231 
Chebyshev acceleration, 279, 294— 
300, 331 
model problem, 277 
Chebyshev polynomial, 296, 314, 330, 
357, 358 
Cholesky, 2, 74-76, 253 
band, 2, 77, 78, 277 
block algorithm, 64, 91 
condition number, 88 
conjugate gradients, 308 
definite pencils, 179 
incomplete (as preconditioner), 
319 
LINPACK, 62 
LR iteration, 243, 263 
mass spring system, 179 
model problem, 277 
normal equations, 107 
of Ty, 270, 357 
on a Cray YMP, 62 
sparse, 80, 81, 277 
symmetric eigenproblem, 253, 263 
tridiagonal, 78, 331 
CLAPACK, 61, 86, 88 
companion matrix, 183, 302 
block, 183 
computational geometry, 139, 175, 
184, 187, 191 
condition number, 2, 4, 5 
convergence of iterative meth- 
ods, 285, 312, 314, 317, 320, 
351 
distance to ill-posedness, 17, 19, 
23, 33, 86, 152 
equilibration, 61 
estimation, 50 
infinite, 17, 148 
iterative refinement of linear sys- 
tems, 58 
least squares, 101, 102, 106, 108, 


Index 


117, 125, 126, 128, 129, 134 
linear equations, 32-38, 46, 50, 
87, 89, 105, 124, 132, 146 
nonsymmetric eigenproblem, 32, 
148-153, 189 
Poisson’s equation, 269 
polynomial evaluation, 15-17, 24 
polynomial roots, 28, 29 
preconditioning, 317 
rank-deficient least squares, 101, 
125, 126, 128, 129 
relative, for Ax = b, 35, 54, 60 
symmetric eigenproblem, 197 
conjugate gradients, 266, 278, 301, 
306-319, 351 
convergence, 306, 312, 352 
model problem, 277 
preconditioning, 317, 351, 354 
conjugate gradients squared, 321 
conjugate gradients stabilized, 321 
conjugate transpose, 1 
conservation law, 255, 256 
consistent ordering, 293 
controllable subspace, 182, 187 
convolution, 324, 326 
Courant—Fischer minimax theorem, 
198, 199, 201, 261 
Cray, 13, 14 
2, 226 
C90/J90, 13, 62, 82, 226 
extended precision, 27 
roundoff error, 12, 25, 26, 224, 
226 
square root, 26 
T3 series, 12, 62, 82 
YMP, 62, 64 


DAEs, see differential algebraic equa- 


tions 
DEC 
symmetric multiprocessor, 62, 
82 


workstations, 10, 12, 14 
deflation, 221 
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during QR iteration, 214 
in secular equation, 221, 237, 
262 
diagonal dominance, 91, 388 
convergence of Jacobi and Gauss- 
Seidel, 286-294 
weak, 289 
differential algebraic equations, 175, 
178, 185 
divide-and-conquer, 13, 195, 211, 212, 
217-228, 231, 236 
SVD, 133, 241, 242 
domain decomposition, 266, 285, 317, 
319, 348-357, 361 
dqds algorithm, 195, 243 


eigenvalue, 140 
generalized nonsymmetric eigen- 
problem, 173 
algorithms, 173-184 
nonsymmetric eigenproblem 
algorithms, 153-173, 184 
perturbation theory, 148-153 
symmetric eigenproblem 
algorithms, 210-237 
perturbation theory, 197—210 
eigenvector, 140 
generalized nonsymmetric eigen- 
problem, 174 
algorithms, 173-184 
nonsymmetric eigenproblem 
algorithms, 153-173, 184 
of Schur form, 148 
symmetric eigenproblem 
algorithms, 210-237 
perturbation theory, 197—210 
EISPACK, 62 
equilibration, 37, 61 
equivalence transformation, 175 


fast Fourier transform, 266, 278, 319, 
321-328, 333, 348, 351, 352, 
357, 359-361 
model problem, 277 
FFT, see fast Fourier transform 


floating point arithmetic, 2, 5, 9, 23 


oo, 12, 28, 231 

complex numbers, 11, 26 

cost of comparison, 50 

cost of division, square root, 245 

cost versus memory operations, 
63 

Cray, 13, 26, 27, 226 

exception handling, 12, 28, 231 

extended precision, 14, 27, 45, 
60, 224 

IEEE standard, 10, 241 

interval arithmetic, 14, 45 

Lanczos algorithm, 377 

machine epsilon, machine pre- 
cision, macheps, 11 

NaN (Not a Number), 12 

normalized numbers, 9 

overflow, 11 

roundoff error, 11 

subnormal numbers, 11 

underflow, 11 


flops, 5 


Gauss-Seidel, 266, 278, 279, 282- 


283, 285-294, 357 
in domain decomposition, 355 
model problem, 277 


Gaussian elimination, 31, 38-44 


band matrices, 76-79 

block algorithm, 31, 61-73 

error bounds, 31, 44-58 

GECP, 46, 49, 55, 56, 88 

GEPP, 46, 49, 55, 56, 87, 88, 
132 

iterative refinement, 31, 58-61 

pivoting, 45 

sparse matrices, 79-83 

symmetric matrices, 76 

symmetric positive definite ma- 


Index 


in GMRES, 321 
in Jacobi’s method, 233, 251 
in QR decomposition, 121, 135 
in QR iteration, 167, 168 
GMRES, 306, 320 
restarted, 321 
Gram-Schmidt orthogonalization pro- 
cess, 107, 377 
Arnoldi’s algorithm, 304, 320 
classical, 107, 119, 134 
modified, 107, 119, 134, 231 
QR decomposition, 107, 119 
stability, 108, 118, 134 
graph 
bipartite, 286, 291 
directed, 288 
strongly connected, 289 
guptri (generalized upper triangu- 
lar form), 186 


Hessenberg form, 163, 184, 213, 302, 
360 
double shift QR iteration, 170, 
172 
implicit Q theorem, 167 
in Arnoldi’s algorithm, 303, 304, 
388, 389 
in GMRES, 320 
QR iteration, 165, 167-173, 183 
reduction, 164-166, 212, 303, 389 
single shift QR iteration, 168 
unreduced, 166 
Hilbert matrix, 85 
Householder reflection, 119-123, 135 
block algorithm, 133, 137, 165 
error analysis, 123 
in bidiagonal reduction, 166, 252 
in double shift QR iteration, 170 
in Hessenberg reduction, 212 
in QR decomposition, 119, 134, 


trices, 74-76 135, 157 
Gershgorin’s theorem, 79, 91, 150 in QR decomposition with piv- 
Givens rotation, 119, 121-123 oting, 132 


error analysis, 123 in tridiagonal reduction, 213 


Index 


HP workstations, 10 


IBM 
370, 9 
RS6000, 5, 14, 27, 68, 95, 133, 
184, 236 
SP-2, 62, 82 
workstations, 10 
ill-posedness, 17, 23, 33, 34, 86, 148 
implicit Q theorem, 167 
impulse response, 178 
incomplete Cholesky, 319 
incomplete LU decomposition, 319 
inertia, 202, 208, 228, 247 
Intel 
8086/8087, 14 
Paragon, 62, 73, 82 
Pentium, 14, 60 
invariant subspace, 145, 147, 153, 
154, 156-158, 189, 207 
inverse iteration, 155, 162 
SVD, 242 
symmetric eigenproblem, 119, 211 
215, 228-232, 236, 237, 241, 
260, 363 
inverse power method, see inverse 
iteration 
irreducibility, 286, 288-290 
iterative methods 
for Ax = Ax, 363-389 
for Ax = b, 265-361 
convergence rate, 281 
splitting, 279 


Jacobi’s method (for Ax = Ax), 195, 
210, 212, 232-237, 260, 263 
Jacobi’s method (for Ax = b), 278, 
279, 281-282, 285-294, 357 
in domain decomposition, 355 
model problem, 277 
Jacobi’s method (for the SVD), 242, 
249-254, 263 
Jordan canonical form, 3, 19, 140, 
141, 144-146, 150, 175, 176, 
178, 180, 184, 185, 188, 280 
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instability, 146, 178 
solving differential equations, 176 


Korteweg-de Vries equation, 260 

Kronecker Canonical Form, iv, 182 

Kronecker canonical form, 179-181, 
185, 187 

solving differential equations, 181 

Kronecker product, 274, 358 

Krylov subspace, 266, 278, 300-321, 
351, 354, 360, 363-389 


Lanczos algorithm, 119, 304-306, 308, 
310, 320, 360, 364-389 
nonsymmetric, 321, 388 
LAPACK, 6, 61, 62, 86, 87, 153 
dlamch, 13 
sbdsdc, 242 
sbdsqr, 242, 243 
sgebrd, 167 
sgeequ, 61 
sgees(x), 153 
sgeesx, 185 
sgees, 185 
sgeev(x), 153 
sgeevx, 185 
sgeev, 185 
sgehrd, 165 
sgelqf, 132 
sgelss, 133 
sgels, 121 
sgeqlf, 132 
sgeqpf, 132, 133 
sgeqrf, 137 
sgerfs, 61 
sgerqf, 132 
sgesvx, 35, 53, 54, 57, 61, 88 
sgesv, 88 
sgetf2, 72, 88 
sgetrf, 72, 88 
sggesx, 185 
sgges, 179, 185 
sggevx, 185 
sggev, 185 
sgglse, 138 


slacon, 53 SVD, 105, 109-117 

slaed3, 227 underdetermined, 2, 101, 136 
slaed4, 222, 224 weighted, 135 

slahqr, 164 linear equations 

slamch, 13 Arnoldi’s method, 320 

slatms, 90 band matrices, 73, 76-79, 81, 
spotrf, 76 82 

sptsv, 79 block algorithm, 61-73 

ssbsv, 77 block cyclic reduction, 328-331 
sspsv, 77 Cauchy matrices, 85 

sstebz, 231, 237 Chebyshev acceleration, 279, 294— 
sstein, 231 300 

ssteqr, 214 Cholesky, 74-76, 277 

ssterf, 214 condition estimation, 50 
sstevd, 211, 217 condition number, 32-38 
sstev, 211 conjugate gradients, 307-321 
ssyevd, 217, 236 direct methods, 31—92 

ssyevx, 212 distance to ill-posedness, 33 


ssyev, 211, 214 
ssygv, 179, 185 
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domain decomposition, 319, 348- 
357 


ssysv, 76 error bounds, 44—58 

ssytrd, 166 fast Fourier transform, 321-328 
strevc, 148 FFT, see fast Fourier trans- 
strsen, 153 form 

strsna, 153 Gauss-Seidel, 279, 282—294 


LAPACK+4, 61 Gaussian elimination, 38—44 
LAPACK90, 61 with complete pivoting (GECP), 
Laplace’s equation, 265 41, 50 

least squares, 101-138 with partial pivoting (GEPP), 


condition number, 117-118, 125, 
126, 128, 134 
in GMRES, 321 
normal equations, 105-107 
overdetermined, 2, 101 
performance, 132-133 
perturbation theory, 117-118 
pseudoinverse, 127 
QR decomposition, 105, 107-109, 
114, 121 
rank-deficient, 125-132 
failure to recognize, 132 
pseudoinverse, 127 
roundoff error, 123-124 
software, 121 


41, 49, 87 
iterative methods, 265-361 
iterative refinement, 58-61 
Jacobi’s method (for Ax = b), 
279, 281-282, 285-294 
Krylov subspace methods, 300- 
321 
LAPACK, 88 
multigrid, 331-348 
perturbation theory, 32-38 
pivoting, 44 
relative condition number, 35- 
38 
relative perturbation theory, 35- 
38 


Index 


sparse Cholesky, 79-83 
sparse Gaussian elimination, 79- 
83 
sparse matrices, 79-83 
SSOR, see symmetric succes- 
sive overrelaxation 
successive overrelaxation, 279, 
283-294 
symmetric matrices, 76 
symmetric positive definite, 74— 
76 
symmetric successive overrelax- 
ation, 279, 294-300 
Toeplitz matrices, 85 
Vandermonde matrices, 83 
LINPACK, 62, 64 
spofa, 62 
benchmark, 73, 86 
LR iteration, 243, 263 
Lyapunov equation, 188 


machine epsilon, machine precision, 
macheps, 11 
mass matrix, 143, 179, 255 
mass-spring system, 142, 174, 179, 
182, 183, 196, 209, 254 
Matlab, 6, 57 
cond, 53 
eig, 179, 185, 211 
fft, 328 
hess, 165 
pinv, 117 
polyfit, 102 
rcond, 53 
roots, 183 
schur, 185 
speig, 388 
bisect.m, 30 
clown, 114 
eigscat.m, 150, 189 
FFT, 359 
homework, 28, 30, 91, 134, 138, 
189-191, 358-360 
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iterative methods for Ar = b, 


266, 301 

Jacobi’s method for Ax = b, 
282, 358 

Lanczos method for Ax = Az, 
369, 377, 384 


least squares, 121, 129 
massspring.m, 144, 197 
multigrid, 337, 360 
notation, 1, 41, 42, 91, 92, 251, 
327 
pivot.m, 50, 55, 56, 61 
Poisson’s equation, 275, 358, 359 
polyplot.m, 29 
qrplt.m, 161, 190 
QRStability.m, 134 
RankDeficient.m, 129 
RayleighContour.m, 201 
sparse matrices, 82 
matrix pencils, 173 
regular, 173 
singular, 173 
memory hierarchy, 63 
MGS, see Gram-Schmidt orthog- 
onalization process, modi- 
fied 
minimum residual algorithm, 320 
MINRES, see minimum residual al- 
gorithm 
model problem, 265-276, 285-286, 
300, 314, 319, 324, 325, 328, 
331, 348, 360, 361 
diagonal dominance, 288, 290 
irreducibility, 290 
red-black ordering, 291 
strong connectivity, 289 
summary of methods, 277—279 
symmetric positive definite, 291 


Moore-Penrose pseudoinverse, see pseu- 


doinverse 
multigrid, 331-348, 357, 360 
model problem, 277 


NETLIB, 86 
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Newton’s method, 58, 219, 221, 231, 
300 
nonsymmetric eigenproblem, 139 
algorithms, 153-173 
condition number, 148 
eigenvalue, 140 
eigenvector, 140 
equivalence transformation, 175 
generalized, 173-184 
algorithms, 184 
ill-posedness, 148 
invariant subspace, 145 
inverse iteration, 155 
inverse power method, see in- 
verse iteration 
matrix pencils, 173 
nonlinear, 182 
orthogonal iteration, 156 
perturbation theory, 148 
power method, 154 
QR iteration, 160 
regular pencil, 173 
Schur canonical form, 146 
similarity transformation, 141 
simultaneous iteration, see or- 
thogonal iteration 
singular pencil, 173 
software, 153 
subspace iteration, see orthog- 
onal iteration 
Weierstrass canonical form, 175 
normal equations, 105, 106, 118, 135, 
136, 320 
backward stability, 118 
norms, 19 
notation, 1 
null space, 111 


ODEs, see ordinary differential equa- 
tions 
ordinary differential equations, 175, 
177, 184, 185 
impulse response, 178 
overdetermined, 181 


Index 


underdetermined, 181 

with algebraic constraints, 178 
orthogonal iteration, 156 
orthogonal matrices, 22, 74, 118, 126, 

131, 160 

backward stability, 124 

error analysis, 123 

Givens rotation, 119 

Householder reflection, 119 

implicit Q theorem, 168 

in bidiagonal reduction, 167 

in definite pencils, 179 

in generalized real Schur form, 

179 

in Hessenberg reduction, 164 

in orthogonal iteration, 156 

in Schur form, 147 

in symmetric QR iteration, 213 

in Toda flow, 256 

Jacobi rotations, 232 


PARPRE, 319 
PCs, 10 
pencils, see matrix pencils 
perfect shuffle, 240, 263 
perturbation theory, 2, 4, 7, 17 
generalized nonsymmetric eigen- 
problem, 180 
least squares, 101, 117, 125 
linear equations, 31, 32, 44, 49 
nonsymmetric eigenproblem, 79, 
139, 141, 148, 180, 187, 189 
polynomial roots, 28 
rank-deficient least squares, 125 
relative, for Ax = Ax, 195, 198, 
207-210, 212, 242, 245-247, 


249, 260, 263 

relative, for Ax = b, 32, 35-38, 
60 

relative, for SVD, 207-210, 246- 
251 


singular pencils, 180 
symmetric eigenproblem, 195, 197, 
207, 261, 263, 367 


Index 


pivoting, 41 
average pivot growth, 86 
band matrices, 77 
by column in QR decomposi- 
tion, 130 
Cholesky, 76 
Gaussian elimination with com- 
plete pivoting (GECP), 49 
Gaussian elimination with par- 
tial pivoting (GEPP), 49, 
132 
growth factor, 49, 59 
Poisson’s equation, 266-279 
in one dimension, 267—270 
in two dimensions, 270-279 
see also model problem, 265 
polynomial 
characteristic, see characteris- 
tic polynomial 
convolution, 326 
evaluation, 34, 83 
at roots of unity, 326 
backward stability, 16 
condition number, 15, 16, 24 
roundoff error, 15, 46 
with Horner’s rule, 7, 15 
fitting, 101, 138 
interpolation, 83 
at roots of unity, 326 
multiplication, 326 
zero finding 
bisection, 7 
computational geometry, 191 
condition number, 28 
power method, 154 
preconditioning, 317, 352-356, 388 
projection, 189 
pseudoinverse, 117, 127, 136 
pseudospectrum, 190 


qds algorithm, 243 

QMR, see quasi-minimum residu- 
als 

QR algorithm, see QR iteration 
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QR decomposition, 105, 107, 131, 
147 
backward stability, 118, 119 
block algorithm, 137 
column pivoting, 130 
in orthogonal iteration, 157 
in QR flow, 258 
in QR iteration, 163, 170 
rank-revealing, 132, 134 
underdetermined least squares, 
136 
QR iteration, 160, 190, 210 
backward stability, 119 
bidiagonal, 242 
convergence failure, 173 
Hessenberg, 163, 165, 184, 213 
implicit shifts, 167-173 
tridiagonal, 211, 212, 236 
convergence, 214 
QRD, see QR decomposition 
quasi-minimum residuals, 321 
quasi-triangular matrix, 147 


range space, 111 
Rayleigh quotient, 198, 205 

iteration, 211, 215, 262, 364 
Rayleigh—Ritz method, 205, 261, 364 
red-black ordering, 283, 291 
relative perturbation theory 

for Ax = Ax, 207-210 

for Ax = b, 35-38 

for SVD, 207-210, 246-249 
roundoff error, 2, 4, 5, 10, 11, 301 

bisection, 30, 230 

block cyclic reduction, 330 

conjugate gradients (CG), 317 

Cray, 13, 26 

dot product, 25 

Gaussian elimination, 25, 44, 57 

geometric modeling, 193 

in logarithm, 25 

inverse iteration, 231 

iterative refinement, 58 
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Jacobi’s method for Ar = Az, 
253 

Jacobi’s method for the SVD, 
251 

Jordan canonical form, 146 

Lanczos algorithm, 305, 364, 369, 
377, 378, 381 

matrix multiplication, 25 

orthogonal iteration, 157 

orthogonal transformations, 101, 
123 

polynomial evaluation, 15 

polynomial root finding, 30 

QR iteration, 164 

rank-deficient least squares, 125, 
127, 128 

rank-revealing QR decomposi- 
tion, 131 

simulating quadruple precision, 
27 

substitution, forward or back, 
26 

SVD, 241, 247 

symmetric eigenproblem, 191 


ScaLAPACK, 61, 72 
ARPACK, 385 
PARPRE, 319 
Schur canonical form, 3, 140, 146- 
148, 152, 158, 160, 163, 175, 
178, 180, 184, 185 
block diagonalization, 188 
computing eigenvectors, 148 
computing matrix functions, 187 
for real matrices, 147, 163, 184, 
213 
generalized for real regular pen- 
cils, 179, 185 
generalized for regular pencils, 
178, 180, 185 
generalized for singular pencils, 
180, 185 
solving Sylvester or Lyapunov 
equations, 188 


Index 


Schur complement, 91, 351 
secular equation, 219 
SGI symmetric multiprocessor, 62, 
82, 84 
shifting, 155 
convergence failure, 173 
exceptional shift, 173 
Francis shift, 172 
in double shift Hessenberg QR 
iteration, 163, 170, 172 
in QR iteration, 161, 172 
in single shift Hessenberg QR 
iteration, 168 
in tridiagonal QR iteration, 213 
Rayleigh quotient shift, 215 
Wilkinson shift, 213 
zero shift, 242 
similarity transformation, 141 
best conditioned, 153, 187 
simultaneous iteration, see orthog- 
onal iteration 
singular value, 109 
algorithms, 237—254 
singular value decomposition, see SVD 


singular vector, 109 

algorithms, 237—254 
SOR, see successive overrelaxation 
sparse matrices 

direct methods for Ax = b, 79- 


83 

iterative methods for Ax = Az, 
363-389 

iterative methods for Ax = b, 
265-361 


spectral projection, 189 

splitting, 279 

SSOR, see symmetric successive over- 
relaxation 

stiffness matrix, 143, 179, 255 

Strassen’s method, 68 

strong connectivity, 289 

subspace iteration, see orthogonal 
iteration 


Index 


substitution (forward or backward), 
3, 38, 44, 48, 86, 177, 188 
error analysis, 26 
successive overrelaxation, 279, 283- 
294, 357 
model problem, 277 
SUN 
symmetric multiprocessor, 62, 
82 
workstations, 10, 14 
SVD, 105, 109-117, 134, 136, 174, 
195 
algorithms, 237—254, 260 
backward stability, 118, 119, 128 
reduction to bidiagonal form, 166, 
240 
relative perturbation theory, 207— 
210 
underdetermined least squares, 
136 
Sylvester equation AX — XB = C, 
188, 358 
Sylvester’s inertia theorem, 202 
symmetric eigenproblem, 195 
algorithms, 210 
bisection, 211, 260 
condition numbers, 197, 207 
Courant—Fischer minimax the- 
orem, 199, 261 
definite pencil, 179 
divide-and-conquer, 13, 211, 217, 
260 
inverse iteration, 211 
Jacobi’s method, 212, 232, 260 
perturbation theory, 197 
Rayleigh quotient, 198 
Rayleigh quotient iteration, 211, 
215 
relative perturbation theory, 207 
Sylvester’s inertia theorem, 202 
tridiagonal QR iteration, 211, 
212 
symmetric successive overrelaxation, 
279, 294-300 
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model problem, 277 
SYMMLAQ, 320 


templates for Ax = b, 266, 279, 301 
Toda flow, 256, 260 
Toda lattice, 255 
Toeplitz matrices, 85 
transpose, 1 
tridiagonal form, 119, 166, 179, 232, 
236, 237, 244, 247, 256, 308, 
330 
bisection, 228—232 
block, 293, 359 
divide-and-conquer, 217 
in block cyclic reduction, 330, 
331 
in boundary value problems, 78 
inverse iteration, 228-232 
nonsymmetric, 321 
QR iteration, 211, 212 
reduction, 163, 166, 197, 213, 
236, 253 
using Lanczos, 303, 304, 320, 
321, 366, 389 
relation to bidiagonal form, 240 


unitary matrices, 22 


Vandermonde matrices, 83 
vec(-), 274 


Weierstrass canonical form, 173, 175, 
176, 178, 180, 181, 185, 187 
solving differential equations, 176 
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Errors in the book ( HRK) 
"Applied Numerical Linear Algebra" 
by James Demmel 
I will continue to post errors and clarifications that I or others find in this location, as well as the 


source. 


Page 22, Lemma 1.7, part 2: This is imprecise on which norms I mean. There are 3 norms in the 
inequality "||A*B|| <= Al] * IIB 


is when A and B are square, and you use the same vector norm in the numerator and denominator 


", and not every choice of 3 norms makes sense. The easiest case 


of definition 1.9 for all 3 norms. This is all I wanted you to prove for Question 1.16. (Hyounjin 
Kim) 

The result is more generally true as long as you use the same norm for the vectors in the domain 
space of A*B and B, the same norm for vectors in the range space of B and the domain space of A, 
and the same norm for vectors in the range space of A*B and the range of A. In other words, you 


can choose three different vector norms. 


Page 23, Lemma 1.7, part 7: The notation "lambda_max (A* A)" means "the largest eigenvalue of 
the matrix conjugate-transpose(A) times A". 

Page 23, Lemma 1.7, Part 13: "|| A ||_1 <= || A || F" should be "(1/sqrt(n)) * || A ||_1 <= || A || F". 
(Hyounjin Kim) 

Page 23, Lemma 1.7, proof: "q*T A^T A q = q^T lambda q" should be "q* A* A q = q* lambda q". 
Page 23, Lemma 1.7, proof: In a denominator of a the second displayed equation, "|| Q* x ) ||" 
should be "|| Q* x ||". 

Page 24, Question 1.7: y*H should be y*. Both are acceptable notations for the conjugate 
transpose of y. (Gerardo Lafferriere) 

Page 26, Question 1.16: See the comments on pages 22 and 23 above. 

Page 27, Question 1.18: In the first numbered fact, "s1 - a" and "(s1 - a) - b" should be "a - s1" and 
"(a - s1) + b". (Matt Podolsky) 

Page 29, Question 1.20, part 2: "perturbed eigenvalues" should be "perturbed roots" in the next to 
last line. (Gerardo Lafferriere) 

Page 29, Question 1.20, part 3: p'(r(i)) means the derivative of the polynomial p, evaluated at r(i). 
Page 32, Section 2.2, line 2: "Ax=B" should be "Ax=b". (JD) 

Page 37, Equation (2.8): "|| x ||" should be "|| x hat ||" in the denominator. (Gerardo Lafferriere) 
Page 52, displayed equation near middle: the not-equal sign should be an equal sign. (Rich Vuduc) 
Page 67, Table 2.1: The defintion of matrix multiplication should contain "b_{kj}", not "b_{jk}". 
(Rich Vuduc) 

Page 72, line 3 of Algorithm 2.9: U should be n-by-n, not m-by-m. (Maksim Oks) 

Page 73 and 74: Change U_{21} and U_{31} to U_{12} and U_{13}, respectively, in the 
displayed factorizations of the matrix A. 

Page 80, Prop. 2.3: The space needed is "n(bL + bU + 1)", not "bL + bU + 1". (Rich Vuduc) 

Page 95, question 2.10. Assume A is n-by-n, not n-by-m. Also assume A is real, or else replace 
A^T by A^H. Assume M is real symmetric (or complex Hermitian, also replacing L^T by L4H). 
The question about A is correct as stated, but we will not define condition numbers for rectangular 


matrices until Chapter 3. (Tsuyoshi Koyama) 
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Page 95, question 2.11. Assume i is not equal to j. (Tsuyoshi Koyama) 

Page 95, question 2.13, parts 2 and 3: The intent is to suppose that you have already done 
Gaussian elimination on A to get its L and U factors, so that solving Ax=b is fast (costs just 
O(n’2)), and then to exploit this to solve By=c in O(n‘2), rather than O(n43), which is what 
Gaussian elimination on B would cost. (Matt Podolsky) 

Page 98, question 2.18 part 1. Assume that the first k steps of Gaussian elimination without 
pivoting succeed, i.e. do not try to divide by 0. (Tsuyoshi Koyama) 

Page 114, Table in the middle of the page: "sigma_{k+1}/sigma_k" in the heading of column 2 
should be "sigma_{k+1}/sigma_1". 

Page 118, line 2: There is an extra closing parenthesis at the end of the line. (Matt Podolsky) 

Page 119, last line: tilde_u should equal x + sign(x_1)*norm(x)*e_1, not x + sign(x_1)*e_1. (Matt 
Podolsky) 

Page 122: The displayed matrix R(i,j,theta) differs from the identity matrix only in rows and 
columns i and j, whose entries are cos(theta) and +-sin(theta). (Matt Podolsky) 

Page 127, 4th paragraph: "b = A^{-1} * x" should be "x = A^{-1} * b", and "b = A4{+} * x" 
should be "x = A*{+} * b". (Guenter Gramlich) 

Page 127, Definition 3.2: "A^+ = V*T Sigma^+ U " should be "A^+ = V Sigma*+ U^T". 

Page 145, Definition 4.4: "R^n" should be "R^n or Cn". 

Page 191, Question 4.15: In the fourth line of matlab, " diag((1.5*ones(1,5)).\verb++(0:4)) + " 
should be " diag((1.5*ones(1,5)).4(0:4)) + " The "\verb+ +" is a latex error. 

Page 192, Question 4.16, line 23: Numterm(2,1) should be NumTerms(2, 1). (William De Meo) 
Page 259, proof of Corollary 5.4: The last displayed equation should be "(d/dt) T(-t) = -(d/dt) T at 
-t = + pi_O(F(T))*T - T*pi_O(F(T)) at -t"; the first term on the right has the wrong sign. (Emile 
Sahouria) 

Page 280, proof of Lemma 6.5: In the 5th line of the displayed equation, " = max_i | lambda_i | + 
eps " should be " <= max_i | lambda_i | + eps " 


Page 304, 5th line of text: "q_j yields q_j Aq_j" should be "q_j*T yields q_j*T A qj". 
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